Multiple people have worked on this report, and every code execution will provide slightly different numbers in some sections. Once the explanation was added to the report, these numbers have not been modified and might vary by a very small margin at times.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook_connected"
print("All Libraries Loaded Successfully")
All Libraries Loaded Successfully
df = pd.read_csv("C:/Users/rohan/OneDrive/Desktop/MBA/Term 4/O712/Group Project/O712 Group Project Data - eCommerce Customers.csv")
# Visualizing the top few rows for confirming successful load of the raw data
df.head(7)
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType | Weekend | Transaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.0 | Returning_Visitor | False | No |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.000000 | 0.100000 | 0.0 | 0.0 | Returning_Visitor | False | No |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.0 | Returning_Visitor | False | No |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.050000 | 0.140000 | 0.0 | 0.0 | Returning_Visitor | False | No |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.020000 | 0.050000 | 0.0 | 0.0 | Returning_Visitor | True | No |
| 5 | 0 | 0.0 | 0 | 0.0 | 19 | 154.216667 | 0.015789 | 0.024561 | 0.0 | 0.0 | Returning_Visitor | False | No |
| 6 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.4 | Returning_Visitor | False | No |
df.tail(6)
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType | Weekend | Transaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12324 | 0 | 0.0 | 1 | 0.0 | 16 | 503.000000 | 0.000000 | 0.037647 | 0.000000 | 0.0 | Returning_Visitor | False | No |
| 12325 | 3 | 145.0 | 0 | 0.0 | 53 | 1783.791667 | 0.007143 | 0.029031 | 12.241717 | 0.0 | Returning_Visitor | True | No |
| 12326 | 0 | 0.0 | 0 | 0.0 | 5 | 465.750000 | 0.000000 | 0.021333 | 0.000000 | 0.0 | Returning_Visitor | True | No |
| 12327 | 0 | 0.0 | 0 | 0.0 | 6 | 184.250000 | 0.083333 | 0.086667 | 0.000000 | 0.0 | Returning_Visitor | True | No |
| 12328 | 4 | 75.0 | 0 | 0.0 | 15 | 346.000000 | 0.000000 | 0.021053 | 0.000000 | 0.0 | Returning_Visitor | False | No |
| 12329 | 0 | 0.0 | 0 | 0.0 | 3 | 21.250000 | 0.000000 | 0.066667 | 0.000000 | 0.0 | New_Visitor | True | No |
df.shape
(12330, 13)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12330 entries, 0 to 12329 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Administrative 12330 non-null int64 1 Administrative_Duration 12330 non-null float64 2 Informational 12330 non-null int64 3 Informational_Duration 12330 non-null float64 4 ProductRelated 12330 non-null int64 5 ProductRelated_Duration 12330 non-null float64 6 BounceRates 12330 non-null float64 7 ExitRates 12330 non-null float64 8 PageValues 12330 non-null float64 9 SpecialDay 12330 non-null float64 10 VisitorType 12330 non-null object 11 Weekend 12330 non-null bool 12 Transaction 12330 non-null object dtypes: bool(1), float64(7), int64(3), object(2) memory usage: 1.1+ MB
There are 12330 rows and all are non-null which means there is no missing data.
df.duplicated().sum()
710
The dataset seems to have 710 duplicated records.
Considering such a large amount of data, it is acceptable to consider that there may be customers presenting the same behaviour/data, especially for those who have spent very little time on Clifford's website.
It is therefore acceptable to proceed with the analysis without deleting data in order not to lose relevant information.
df.describe()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 |
| mean | 2.315166 | 80.818611 | 0.503569 | 34.472398 | 31.731468 | 1194.746220 | 0.022191 | 0.043073 | 5.889258 | 0.061427 |
| std | 3.321784 | 176.779107 | 1.270156 | 140.749294 | 44.475503 | 1913.669288 | 0.048488 | 0.048597 | 18.568437 | 0.198917 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 184.137500 | 0.000000 | 0.014286 | 0.000000 | 0.000000 |
| 50% | 1.000000 | 7.500000 | 0.000000 | 0.000000 | 18.000000 | 598.936905 | 0.003112 | 0.025156 | 0.000000 | 0.000000 |
| 75% | 4.000000 | 93.256250 | 0.000000 | 0.000000 | 38.000000 | 1464.157214 | 0.016813 | 0.050000 | 0.000000 | 0.000000 |
| max | 27.000000 | 3398.750000 | 24.000000 | 2549.375000 | 705.000000 | 63973.522230 | 0.200000 | 0.200000 | 361.763742 | 1.000000 |
Seeing the basic summary of the dataset it could be deduced that all the numerical columns seems to be in acceptable and plausible ranges.
First of all, the one-hot-encoding technique is used to make it possible to display non-numeric columns graphically as well.
df['VisitorType Status'] = (pd.get_dummies(df['VisitorType'])).iloc[:,1]
df['Weekend Status'] = (pd.get_dummies(df['Weekend'])).iloc[:,1]
df['Transaction Status'] = (pd.get_dummies(df['Transaction'])).iloc[:,1]
df.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType | Weekend | Transaction | VisitorType Status | Weekend Status | Transaction Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Returning_Visitor | True | No | 1 | 1 | 0 |
df.shape
(12330, 16)
df.corr()
C:\Users\rohan\AppData\Local\Temp\ipykernel_24048\1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Weekend | VisitorType Status | Weekend Status | Transaction Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Administrative | 1.000000 | 0.601583 | 0.376850 | 0.255848 | 0.431119 | 0.373939 | -0.223563 | -0.316483 | 0.098990 | -0.094778 | 0.026417 | -0.022884 | 0.026417 | 0.138917 |
| Administrative_Duration | 0.601583 | 1.000000 | 0.302710 | 0.238031 | 0.289087 | 0.355422 | -0.144170 | -0.205798 | 0.067608 | -0.073304 | 0.014990 | -0.022525 | 0.014990 | 0.093587 |
| Informational | 0.376850 | 0.302710 | 1.000000 | 0.618955 | 0.374164 | 0.387505 | -0.116114 | -0.163666 | 0.048632 | -0.048219 | 0.035785 | 0.057399 | 0.035785 | 0.095200 |
| Informational_Duration | 0.255848 | 0.238031 | 0.618955 | 1.000000 | 0.280046 | 0.347364 | -0.074067 | -0.105276 | 0.030861 | -0.030577 | 0.024078 | 0.045501 | 0.024078 | 0.070345 |
| ProductRelated | 0.431119 | 0.289087 | 0.374164 | 0.280046 | 1.000000 | 0.860927 | -0.204578 | -0.292526 | 0.056282 | -0.023958 | 0.016092 | 0.128738 | 0.016092 | 0.158538 |
| ProductRelated_Duration | 0.373939 | 0.355422 | 0.387505 | 0.347364 | 0.860927 | 1.000000 | -0.184541 | -0.251984 | 0.052823 | -0.036380 | 0.007311 | 0.120489 | 0.007311 | 0.152373 |
| BounceRates | -0.223563 | -0.144170 | -0.116114 | -0.074067 | -0.204578 | -0.184541 | 1.000000 | 0.913004 | -0.119386 | 0.072702 | -0.046514 | 0.129908 | -0.046514 | -0.150673 |
| ExitRates | -0.316483 | -0.205798 | -0.163666 | -0.105276 | -0.292526 | -0.251984 | 0.913004 | 1.000000 | -0.174498 | 0.102242 | -0.062587 | 0.171987 | -0.062587 | -0.207071 |
| PageValues | 0.098990 | 0.067608 | 0.048632 | 0.030861 | 0.056282 | 0.052823 | -0.119386 | -0.174498 | 1.000000 | -0.063541 | 0.012002 | -0.115825 | 0.012002 | 0.492569 |
| SpecialDay | -0.094778 | -0.073304 | -0.048219 | -0.030577 | -0.023958 | -0.036380 | 0.072702 | 0.102242 | -0.063541 | 1.000000 | -0.016767 | 0.087123 | -0.016767 | -0.082305 |
| Weekend | 0.026417 | 0.014990 | 0.035785 | 0.024078 | 0.016092 | 0.007311 | -0.046514 | -0.062587 | 0.012002 | -0.016767 | 1.000000 | -0.039444 | 1.000000 | 0.029295 |
| VisitorType Status | -0.022884 | -0.022525 | 0.057399 | 0.045501 | 0.128738 | 0.120489 | 0.129908 | 0.171987 | -0.115825 | 0.087123 | -0.039444 | 1.000000 | -0.039444 | -0.103843 |
| Weekend Status | 0.026417 | 0.014990 | 0.035785 | 0.024078 | 0.016092 | 0.007311 | -0.046514 | -0.062587 | 0.012002 | -0.016767 | 1.000000 | -0.039444 | 1.000000 | 0.029295 |
| Transaction Status | 0.138917 | 0.093587 | 0.095200 | 0.070345 | 0.158538 | 0.152373 | -0.150673 | -0.207071 | 0.492569 | -0.082305 | 0.029295 | -0.103843 | 0.029295 | 1.000000 |
import seaborn as sns
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot = True, cmap = 'RdBu',fmt =".2f",vmin=-1)
plt.show()
C:\Users\rohan\AppData\Local\Temp\ipykernel_24048\3845839996.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
From this heatmap it is possible to derive several key insights into the data:
There is a strong correlation between Number of Product Page Visits and Product Duration which is obvious
It is important to note here that overall there is a (not substantially strong) but positive intercorrelation between
Administrative Page Visit, its' duration ↔ Informative Page Visit, its' duration ↔ Product Related Page Visit, its' duration.
This indicates a tendency for increased engagement in one page category to be associated with increased engagement in others. Such a pattern may reflect a user's comprehensive engagement with the site's content, where a heightened interest in any single aspect of the e-commerce platform could lead to a more extensive browsing behavior overall.
Overall user engagement on the site correlates positively with the likelihood of making a transaction, with different types of content varying in their impact.
→ Engagement with product-related content, both in terms of visit duration (0.15) and frequency (0.16), is most strongly associated with transaction completion.
→ Administrative content follows, where both the number of visits (0.14) and the time spent (0.09) show a positive correlation with purchases, albeit less so than product-related content.
→ Informational content shows the least correlation (0.10 for visits and 0.07 for duration), suggesting that while it contributes to transaction likelihood, it does so to a lesser extent than administrative or product-related engagement.
This hierarchy emphasizes that deeper engagement, particularly with product-related pages, is a significant predictor of e-commerce transactions. And suggests that improving the user experience on administrative and product pages could lead to higher transaction rates.
PageValues has the strongest positive correlation with Transaction_Status (0.49)
→ Which means that the average value of a page a user visited before completing a transaction is highly predictive of transaction likelihood
→ This also translates to if a visitor visits or is somehow drawn to pages with higher page values those visitors could likely make a transaction on Clifford's Website
There is a strong Correlation of Bounce and Exit Rates which is obvious
There is a negative correlation of Bounce&Exit Rates with Transaction Status which is obvious
→ This further indicates that a lack of engagement lead to a transaction with less likelihood.
SpecialDay has a very low negative correlation with Transaction_Status (-0.08)
→ This suggest that the closeness to a special day has a negligible effect on the likelihood of a transaction occurring.
Weekend has no significant correlation with Transaction_Status (0.03)
→ This indicates that transactions are just as likely to occur on weekends as on weekdays.
The VisitorType shows a very small positive correlation with Transaction_Status (0.10)
→ This is a hint that whether a visitor is new or returning has a slight effect on transaction likelihood.
There is a positive correlation between visitor status and both No. of Product Related engagement and bounce&exit Rates
→ This explains that returning visitors tend to visit more Product Related Pages and Spend More time on those pages but also leave more often
Now we'll analyse in detail the variables with the strongest correlation (positive or negative) with the most relevant variable from a business perspective: Transaction_Status; to better understand their distribution. In particular, we will analyse: PageValues, ExitRates, ProductRelated_Duration
Note: Being Exit Rates and Bounce Rates, as well as Product Related Duration and Product Related highly correlated we expect to gain insights of both even analysing one of the two only.
fig = px.box(df, y="PageValues")
fig.show()
The boxplot for PageValues on Clifford's e-commerce website indicates a skew towards the lower end of the PageValue spectrum.
The majority of pages have a relatively low value, but there are several outliers indicating a few pages with exceptionally high values.
This pattern suggests that while most visitors engage with pages that have little influence on transactions, a select few pages are highly effective in contributing to the site's overall revenue.
fig = px.box(df, y="ExitRates")
fig.show()
px.histogram(df, x="ExitRates",nbins=100)
The boxplot for the ExitRates on Clifford's e-commerce website demonstrates a concentration of lower values, with the majority of sessions ending with a low exit rate. This suggests that a substantial portion of users navigate through multiple pages before leaving the site.
The right-skewed distribution seen in the histogram reinforces this observation, indicating that while most users exhibit low exit rates, there is a tail of sessions that end after viewing only a few pages.
fig = px.box(df, y="ProductRelated")
fig.show()
The boxplot for 'ProductRelated_Duration' on Clifford's e-commerce website indicates that the majority of user sessions involve relatively short durations on product-related pages, as evidenced by the box being compressed towards the lower end of the scale.
There is a significant number of outliers, which suggests that there are a few users who spend a substantial amount of time engaged with product content.
df["Transaction"].value_counts()
No 10422 Yes 1908 Name: Transaction, dtype: int64
table0 = pd.pivot_table(df,index=["Transaction"],values="Transaction Status",aggfunc="count")
table0.reset_index(inplace=True)
table0
| Transaction | Transaction Status | |
|---|---|---|
| 0 | No | 10422 |
| 1 | Yes | 1908 |
Out of 12330 Visitors of Clifford Website, only 1908 have made purchases in the last year from their eCommerce Website.
table2 = pd.pivot_table(df,index=["Weekend"],values="Transaction Status",aggfunc="count")
table2.reset_index(inplace=True)
table2
| Weekend | Transaction Status | |
|---|---|---|
| 0 | False | 9462 |
| 1 | True | 2868 |
Out of all the Visitors to the Clifford's eCommerce Website, approximaltely 23% of visitors visit Clifford's Website mostly During Weekends.
Lets see if there is an interaction between the Transaction Status and Visiting the Weekend
table3 = pd.pivot_table(df,index=["Weekend","Transaction"],values="Transaction Status",aggfunc="count")
table3.reset_index(inplace=True)
table3
| Weekend | Transaction | Transaction Status | |
|---|---|---|---|
| 0 | False | No | 8053 |
| 1 | False | Yes | 1409 |
| 2 | True | No | 2369 |
| 3 | True | Yes | 499 |
fig = px.bar(data_frame=table3,x="Transaction",y="Transaction Status",color="Weekend",barmode="group",text="Transaction Status",
title="Number of Visitors By Transaction Status & Weekend",height=650,
labels={"Transaction Status":"Number of Visitors"})
fig["data"]
fig["data"][0]["textposition"] = "outside"
fig["data"][1]["textposition"] = "outside"
fig["data"][0]["marker"]["color"] = "#2BB876"
fig["data"][1]["marker"]["color"] = "#9b287b"
# fig["data"]
fig.show()
Out of all transactions made on Clifford's website, around 26% are completed during weekends.
When comparing the likelihood of transactions, visitors are roughly 17% likely to make a purchase during the weekend, as opposed to approximately 15% on weekdays. This indicates that visitors are somewhat more inclined to complete transactions on weekends than during weekdays.
table1 = pd.pivot_table(df,index=["VisitorType"],values="VisitorType Status",aggfunc="count")
table1.reset_index(inplace=True)
table1
| VisitorType | VisitorType Status | |
|---|---|---|
| 0 | New_Visitor | 1779 |
| 1 | Returning_Visitor | 10551 |
Out of all the Visitors to the Clifford's eCommerce Website -
1779 were New Visitors
10551 were Returning Visitors
table4 = pd.pivot_table(df,index=["Transaction","VisitorType"],values="Transaction Status",aggfunc="count")
table4.reset_index(inplace=True)
table4
| Transaction | VisitorType | Transaction Status | |
|---|---|---|---|
| 0 | No | New_Visitor | 1341 |
| 1 | No | Returning_Visitor | 9081 |
| 2 | Yes | New_Visitor | 438 |
| 3 | Yes | Returning_Visitor | 1470 |
fig = px.bar(data_frame=table4,x="VisitorType",y="Transaction Status",color="Transaction",barmode="group",text="Transaction Status",
title="Number of Visitors By Transaction Status & Visitor Type",height=650,
labels={"Transaction Status":"Number of Visitors"})
fig["data"]
fig["data"][0]["textposition"] = "outside"
fig["data"][1]["textposition"] = "outside"
fig["data"][0]["marker"]["color"] = "#223C50"
fig["data"][1]["marker"]["color"] = "#4CA334"
# fig["data"]
fig.show()
It is evident to note here that:
→ The conversion rate for New Visitors on Clifford's website is approximately 24.6%, indicating that roughly 1 in 4 new visitors makes a transaction.
→ The conversion rate for Returning Visitors is about 13.9%, suggesting that just over 1 in
7 of these visitors completes a transaction.
Despite the higher frequency of visits by returning visitors, new visitors have a higher likelihood of making a transaction on Clifford's website.
table5 = pd.pivot_table(df,index=["Transaction","VisitorType","Weekend"],values="Transaction Status",aggfunc="count")
table5.reset_index(inplace=True)
table5
| Transaction | VisitorType | Weekend | Transaction Status | |
|---|---|---|---|---|
| 0 | No | New_Visitor | False | 961 |
| 1 | No | New_Visitor | True | 380 |
| 2 | No | Returning_Visitor | False | 7092 |
| 3 | No | Returning_Visitor | True | 1989 |
| 4 | Yes | New_Visitor | False | 332 |
| 5 | Yes | New_Visitor | True | 106 |
| 6 | Yes | Returning_Visitor | False | 1077 |
| 7 | Yes | Returning_Visitor | True | 393 |
fig = px.bar(data_frame=table5,x="VisitorType",y="Transaction Status",color="Transaction",barmode="group",text="Transaction Status",
title="Number of Visitors By Transaction Status & Visitor Type Differentiated by Weekend",height=650,facet_col="Weekend",
labels={"Transaction Status":"Number of Visitors"})
fig["data"]
fig["data"][0]["textposition"] = "outside"
fig["data"][1]["textposition"] = "outside"
fig["data"][2]["textposition"] = "outside"
fig["data"][3]["textposition"] = "outside"
fig["data"][0]["marker"]["color"] = "#223C50"
fig["data"][1]["marker"]["color"] = "#223C50"
# fig["data"]
fig.show()
It is evident that:
Generally, there is a slight decrease in the conversion rate for new visitors on weekends, while the opposite trend is observed for returning visitors, where the conversion rate increases. This variation, while present, remains within an acceptable range, indicating that weekends may present a different dynamic in visitor behavior.
sns.relplot(data=df,x="SpecialDay",y="ProductRelated_Duration",hue="Transaction",col="Transaction",kind="line")
plt.show()
The graph shows that for both groups (those who did and did not make a purchase), the duration on product-related pages fluctuates but generally trends upward as the special day approaches, with a notable increase in duration for non-transacting visitors as the special day nears 1.0.
For visitors who made a purchase, the duration spent on product-related pages is consistently lower than for those who did not make a purchase. Interestingly, the duration for transacting visitors peaks as the special day approaches 0.8, suggesting a potential initial rush to make purchases as a special day approaches, followed by a decline perhaps due to the completion of their intended purchases.
The peak in duration for non-transacting visitors near the special day implies that these visitors may be browsing more or taking longer to make a decision, which could be due to a variety of factors such as increased options, special deals, or indecision.
sns.relplot(data=df,x="SpecialDay",y="PageValues",hue="Transaction",col="VisitorType",kind="line")
plt.show()
Once again we notice how higher PageValue suggest more likelihood of making a transaction.
The key takeaway from this graph could be that returning visitors who made a transaction demonstrate consistent engagement without throughout the period leading to a special day, while new transacting visitors' highest engagement and potential value generation peaks and then falls as the special day approaches.
sns.lmplot(data=df,x="PageValues",y="Transaction Status",col="VisitorType",facet_kws=dict(sharex=False, sharey=False),logistic=True)
plt.show()
The logistic regression analysis shows that the probability of a transaction sharply increases with PageValues for both new and returning visitors, indicating that more valuable content is highly effective at driving purchases.
The threshold at which new visitors' transaction probability levels off is lower than that of returning visitors, suggesting that new visitors may require less interaction with high-value pages to be convinced to make a purchase.
sns.lmplot(data=df,x="PageValues",y="Transaction Status",col="Weekend",facet_kws=dict(sharex=False, sharey=False),logistic=True)
plt.show()
The analysis indicates a robust relationship between PageValues and the likelihood of transactions on Clifford's website, with both weekdays and weekends showing a pronounced increase in transaction probability with higher PageValues.
The leveling off of transaction likelihood occurs at a lower PageValue over weekends, suggesting that visitors may have a stronger purchase intent or respond more to weekend promotions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook_connected"
print("All Libraries Loaded Successfully")
All Libraries Loaded Successfully
df = pd.read_csv("Clifford_Clean with Duplicates.csv")
df.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType | Weekend | Transaction | VisitorType Status | Weekend Status | Transaction Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Returning_Visitor | False | No | 1 | 0 | 0 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Returning_Visitor | True | No | 1 | 1 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12330 entries, 0 to 12329 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Administrative 12330 non-null int64 1 Administrative_Duration 12330 non-null float64 2 Informational 12330 non-null int64 3 Informational_Duration 12330 non-null float64 4 ProductRelated 12330 non-null int64 5 ProductRelated_Duration 12330 non-null float64 6 BounceRates 12330 non-null float64 7 ExitRates 12330 non-null float64 8 PageValues 12330 non-null float64 9 SpecialDay 12330 non-null float64 10 VisitorType 12330 non-null object 11 Weekend 12330 non-null bool 12 Transaction 12330 non-null object 13 VisitorType Status 12330 non-null int64 14 Weekend Status 12330 non-null int64 15 Transaction Status 12330 non-null int64 dtypes: bool(1), float64(7), int64(6), object(2) memory usage: 1.4+ MB
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot = True, cmap = 'RdBu',fmt =".2f",vmin=-1)
C:\Users\rohan\AppData\Local\Temp\ipykernel_24048\3420667235.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
<Axes: >
col = ['Administrative', 'Administrative_Duration', 'Informational',
'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'VisitorType Status', 'Weekend Status',
'Transaction Status','VisitorType',
'Weekend', 'Transaction',]
df= df[col]
df.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType Status | Weekend Status | Transaction Status | VisitorType | Weekend | Transaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | 1 | 0 | 0 | Returning_Visitor | False | No |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | 1 | 0 | 0 | Returning_Visitor | False | No |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | 1 | 0 | 0 | Returning_Visitor | False | No |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | 1 | 0 | 0 | Returning_Visitor | False | No |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | 1 | 1 | 0 | Returning_Visitor | True | No |
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import roc_curve, roc_auc_score,classification_report,confusion_matrix,accuracy_score
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
X = df.iloc[:,0:12]
Y = df.iloc[:,12]
Xtrain, Xtest, ytrain, ytest = train_test_split(X,Y, test_size = 0.2, random_state = 42)
Xtrain.head()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType Status | Weekend Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1785 | 0 | 0.0 | 0 | 0.0 | 7 | 95.000000 | 0.014286 | 0.061905 | 0.000000 | 0.0 | 1 | 0 |
| 10407 | 2 | 14.0 | 0 | 0.0 | 81 | 1441.910588 | 0.002469 | 0.013933 | 2.769599 | 0.0 | 1 | 0 |
| 286 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.000000 | 0.0 | 1 | 0 |
| 6520 | 5 | 49.2 | 4 | 379.0 | 5 | 74.600000 | 0.000000 | 0.018182 | 8.326728 | 0.0 | 0 | 0 |
| 12251 | 0 | 0.0 | 1 | 5.0 | 9 | 279.000000 | 0.040000 | 0.041667 | 0.000000 | 0.0 | 0 | 1 |
ytrain.head()
1785 0 10407 0 286 0 6520 0 12251 0 Name: Transaction Status, dtype: int64
log_reg_full = LogisticRegression()
log_reg_full.fit(Xtrain, ytrain)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
ytrain_predicted_full_log = log_reg_full.predict(Xtrain)
ytrain_predicted_prob_full_log = log_reg_full.predict_proba(Xtrain)
ytest_predicted_full_log = log_reg_full.predict(Xtest)
ytest_predicted_prob_full_log = log_reg_full.predict_proba(Xtest)
print("Accuracy Score (train):", accuracy_score(y_pred=ytrain_predicted_full_log,y_true= ytrain))
print("Accuracy Score (test):",accuracy_score(y_pred=ytest_predicted_full_log,y_true= ytest))
Accuracy Score (train): 0.8871654501216545 Accuracy Score (test): 0.8718572587185726
sns.heatmap(confusion_matrix(ytest, ytest_predicted_full_log), annot = True, fmt = 'g')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title("Confusion Matrix for Full Scale Log R Model")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 146 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 2004 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 51 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 265 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
The High values of False Negatives is much concerning as the model is Over Predicting Customers as NO Transaction even though they have made the transaction
TP_full = confusion_matrix(ytest, ytest_predicted_full_log)[1,1]
TN_full =confusion_matrix(ytest, ytest_predicted_full_log)[0,0]
FP_full = confusion_matrix(ytest, ytest_predicted_full_log)[0,1]
FN_full = confusion_matrix(ytest, ytest_predicted_full_log)[1,0]
(TP_full + TN_full )/ (TP_full + TN_full +FP_full +FN_full)
0.8718572587185726
print(classification_report(ytest, ytest_predicted_full_log))
precision recall f1-score support
0 0.88 0.98 0.93 2055
1 0.74 0.36 0.48 411
accuracy 0.87 2466
macro avg 0.81 0.67 0.70 2466
weighted avg 0.86 0.87 0.85 2466
The Recall for YES Transaction here is super LOW just 0.36 worse than a 50-50 Classifier.
There is also significant difference between Precision of YES TRANSACTION (0.74) and Precision of NO TRANSACTION (0.88)
The F1-Score for YES Transaction is quite low (0.48)
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
Lets Complete the Model Evaluation for the Full Scale LogR Model
fpr_full_log,tpr_full_log, threshold_full_log = roc_curve(ytest,log_reg_full.predict_proba(Xtest)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_full_log,tpr_full_log,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Full Scale Log Model")
plt.legend()
plt.show()
roc_auc_score(ytest,ytest_predicted_full_log)
0.6652068126520682
Full Log Model Summary
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
scaler = StandardScaler() # Create a scaler object
scaler.fit(Xtrain) # using the fit function I will provide the base mean and standard deviation to the object
XtrainScaled = scaler.transform(Xtrain) # Now the values for the Xtrain has been scaled
XtestScaled = scaler.transform(Xtest) # Now the values for the Xtest has been scaled
Choosing a random k = 2
# Create the object of the KNeighboursClassifier with the
knn_full = KNeighborsClassifier(n_neighbors=2)
knn_full
KNeighborsClassifier(n_neighbors=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=2)
knn_full.fit(XtrainScaled,ytrain)
KNeighborsClassifier(n_neighbors=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=2)
ytrainPredicted_knn_full = knn_full.predict(XtrainScaled) # Store the predictions from the model on the train dataset
ytestPredicted_knn_full = knn_full.predict(XtestScaled) # Store the predictions from the model on the test dataset
print("Accuracy Score (train):", accuracy_score(y_pred=ytrainPredicted_knn_full,y_true= ytrain))
print("Accuracy Score (test):",accuracy_score(y_pred=ytestPredicted_knn_full,y_true= ytest))
Accuracy Score (train): 0.9283252230332523 Accuracy Score (test): 0.8718572587185726
Accuracy_dict_knn_full = {
"N":[],
"train_acc" : [],
"test_acc" :[]
}
accuracy_df_knn_full=pd.DataFrame(Accuracy_dict_knn_full)
accuracy_df_knn_full
| N | train_acc | test_acc |
|---|
for i in range(1,50):
new_row = []
knn_full_ = KNeighborsClassifier(n_neighbors=i)
knn_full_.fit(XtrainScaled,ytrain)
ytrainPredicted_knn_full__ = knn_full_.predict(XtrainScaled) # Store the predictions from the model on the train dataset
ytestPredicted_knn_full__ = knn_full_.predict(XtestScaled) # Store the predictions from the model on the test dataset
new_row.append(i)
new_row.append(accuracy_score(y_true=ytrain,y_pred=ytrainPredicted_knn_full__))
new_row.append(accuracy_score(y_true=ytest,y_pred=ytestPredicted_knn_full__))
accuracy_df_knn_full.loc[len(accuracy_df_knn_full)] = new_row
accuracy_df_knn_full.head()
| N | train_acc | test_acc | |
|---|---|---|---|
| 0 | 1.0 | 0.999696 | 0.844282 |
| 1 | 2.0 | 0.928325 | 0.871857 |
| 2 | 3.0 | 0.926399 | 0.873479 |
| 3 | 4.0 | 0.911598 | 0.877129 |
| 4 | 5.0 | 0.914538 | 0.875912 |
fig = px.line(x = accuracy_df_knn_full["N"],y=[accuracy_df_knn_full["train_acc"],accuracy_df_knn_full['test_acc']],labels={"variable":"Accuracy Type","value":"Accuracy"})
fig["data"][0]["name"] ="Train Accuracy"
fig["data"][1]["name"] ="Test Accuracy"
fig.show()
knn_full_final = KNeighborsClassifier(n_neighbors=24)
knn_full_final.fit(XtrainScaled,ytrain)
ytestPredicted_knn_full_final = knn_full_final.predict(XtestScaled)
accuracy_score(y_true=ytest,y_pred=ytestPredicted_knn_full_final)*100
87.87510137875101
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_full_final)
array([[2009, 46],
[ 253, 158]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_full_final),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Full KNN Model")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 158 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 2009 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 46 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 253 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
The High values of False Negatives is much concerning as the model is Over Predicting Customers as NO Transaction even though they have made the transaction
TP_knn_full_final = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_full_final)[1,1]
TN_knn_full_final =confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_full_final)[0,0]
FP_knn_full_final = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_full_final)[0,1]
FN_knn_full_final = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_full_final)[1,0]
(TP_knn_full_final + TN_knn_full_final )/ (TP_knn_full_final + TN_knn_full_final +FP_knn_full_final +FN_knn_full_final)
0.8787510137875101
print(classification_report(ytest,ytestPredicted_knn_full_final))
precision recall f1-score support
0 0.89 0.98 0.93 2055
1 0.77 0.38 0.51 411
accuracy 0.88 2466
macro avg 0.83 0.68 0.72 2466
weighted avg 0.87 0.88 0.86 2466
The Recall for YES Transaction here is super LOW just 0.38 worse than a 50-50 Classifier.
There is also significant difference between Precision of YES TRANSACTION (0.77) and Precision of NO TRANSACTION (0.89)
The F1-Score for YES Transaction is quite low (0.51)
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
Lets Complete the Model Evaluation for the Full Scale Best N KNN Model
fpr_knn_full_final,tpr_knn_full_final, threshold_knn_full_final = roc_curve(ytest,knn_full_final.predict_proba(XtestScaled)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_knn_full_final,tpr_knn_full_final,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Full Scale BEST-K KNN Model")
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_knn_full_final)
0.681021897810219
BEST N Full Scale KNN Model
Overall, this means that our BEST N Full Scale KNN Model model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e poorly identify customers who made a purchase on Cliffords'Website.
Develop a full grown tree
full_dt = DecisionTreeClassifier()
full_dt.fit(Xtrain,ytrain)
full_dt
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier()
yTrainPredicted_full_dt = full_dt.predict(Xtrain)
yTestPredicted_full_dt = full_dt.predict(Xtest)
yTrainPredicted_full_dt.size,yTestPredicted_full_dt.size
(9864, 2466)
print("Accuracy Score (train):", accuracy_score(y_pred=yTrainPredicted_full_dt,y_true= ytrain))
print("Accuracy Score (test):",accuracy_score(y_pred=yTestPredicted_full_dt,y_true= ytest))
Accuracy Score (train): 0.9996958637469586 Accuracy Score (test): 0.8552311435523114
plt.figure(figsize=(20,15))
plot_tree(full_dt)
plt.show()
We now need to generalize / prune the tree
Find the perfect depth of the tree.
Accuracy_dict_DT = {
"N":[],
"train_acc" : [],
"test_acc" :[]
}
accuracy_df_full_dt=pd.DataFrame(Accuracy_dict_DT)
accuracy_df_full_dt
| N | train_acc | test_acc |
|---|
for i in range(1,20):
new_row=[]
dt_full_dt = DecisionTreeClassifier(random_state=1,max_depth=i)
dt_full_dt.fit(Xtrain,ytrain)
yTrainPredicted_full_dt_ = dt_full_dt.predict(Xtrain)
yTestPredicted_full_dt_ = dt_full_dt.predict(Xtest)
new_row.append(i)
new_row.append(accuracy_score(y_true=ytrain,y_pred=yTrainPredicted_full_dt_))
new_row.append(accuracy_score(y_true=ytest,y_pred=yTestPredicted_full_dt_))
accuracy_df_full_dt.loc[len(accuracy_df_full_dt)] = new_row
accuracy_df_full_dt.head()
| N | train_acc | test_acc | |
|---|---|---|---|
| 0 | 1.0 | 0.876723 | 0.872263 |
| 1 | 2.0 | 0.893958 | 0.876318 |
| 2 | 3.0 | 0.896796 | 0.879157 |
| 3 | 4.0 | 0.903893 | 0.881995 |
| 4 | 5.0 | 0.906427 | 0.880373 |
import plotly.express as px
fig = px.line(x = accuracy_df_full_dt["N"],y=[accuracy_df_full_dt["train_acc"],accuracy_df_full_dt['test_acc']])
fig["data"][0]["name"] ="Train Accuracy"
fig["data"][1]["name"] ="Test Accuracy"
fig.show()
Lets Try to train the BEST Depth Full Tree and Evaluate the Model
dt_full_final = DecisionTreeClassifier(random_state=1,max_depth=4)
dt_full_final.fit(Xtrain,ytrain)
yTrainPredicted_dt_full_final = dt_full_final.predict(Xtrain)
yTestPredicted_dt_full_final = dt_full_final.predict(Xtest)
plt.figure(figsize=(50,25))
plot_tree(dt_full_final,feature_names=list(Xtrain.columns),filled=True,class_names=["NO Transaction","YES Transaction"],fontsize=18)
plt.show()
confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_full_final)
array([[1972, 83],
[ 208, 203]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_full_final),annot=True,fmt="g")
plt.xlabel("Prediction Values")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Full Decision Tree")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 203 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1972 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 83 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 208 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
The High values of False Negatives is much concerning as the model is Over Predicting Customers as NO Transaction even though they have made the transaction
TP_full_dt = confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_full_final)[1,1]
TN_full_dt =confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_full_final)[0,0]
FP_full_dt = confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_full_final)[0,1]
FN_full_dt = confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_full_final)[1,0]
(TP_full_dt + TN_full_dt )/ (TP_full_dt + TN_full_dt +FP_full_dt +FN_full_dt)
0.8819951338199513
print(classification_report(y_true=ytest,y_pred=yTestPredicted_dt_full_final))
precision recall f1-score support
0 0.90 0.96 0.93 2055
1 0.71 0.49 0.58 411
accuracy 0.88 2466
macro avg 0.81 0.73 0.76 2466
weighted avg 0.87 0.88 0.87 2466
Even the FULL DT Is unable to classify the YES Transaction with High Precision and High Recall.
This Tree needs to be improved.
The Recall for YES Transaction here is super LOW just 0.49 similar than a 50-50 Classifier.
There is also significant difference between Precision of YES TRANSACTION (0.71) and Precision of NO TRANSACTION (0.9)
The F1-Score for YES Transaction is quite low (0.58)
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
Lets Complete the Model Evaluation for the Full Scale Best Depth Decision Tree
dt_full_final.feature_importances_
array([0.03170617, 0.00592976, 0. , 0.00097663, 0.0093989 ,
0.00875109, 0.08719713, 0.00752461, 0.8485157 , 0. ,
0. , 0. ])
pd.DataFrame({
"Feature":Xtrain.columns,
"Importance":dt_full_final.feature_importances_
})
| Feature | Importance | |
|---|---|---|
| 0 | Administrative | 0.031706 |
| 1 | Administrative_Duration | 0.005930 |
| 2 | Informational | 0.000000 |
| 3 | Informational_Duration | 0.000977 |
| 4 | ProductRelated | 0.009399 |
| 5 | ProductRelated_Duration | 0.008751 |
| 6 | BounceRates | 0.087197 |
| 7 | ExitRates | 0.007525 |
| 8 | PageValues | 0.848516 |
| 9 | SpecialDay | 0.000000 |
| 10 | VisitorType Status | 0.000000 |
| 11 | Weekend Status | 0.000000 |
fpr_full_dt,tpr_full_dt, threshold_full_dt = roc_curve(ytest,dt_full_final.predict_proba(Xtest)[:,1])
plt.plot(fpr_full_dt,tpr_full_dt,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Full Scale Decision Tree")
plt.legend()
plt.show()
roc_auc_score(ytest,yTestPredicted_dt_full_final)
0.7267639902676399
Full Scale BEST Depth Decision Tree Model
Overall, this means that our BEST Depth Full Scale Decision Tree model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e poorly identify customers who made a purchase on Cliffords'Website.
| Comparision | Full LogR | Full Scale BEST-K KNN | Full Scale BEST-Depth DT |
|---|---|---|---|
| Model Accuracy | 87.18% | 87.87% | 88.19% |
| False Negatives | 265 | 253 | 208 |
| False Positives | 51 | 46 | 83 |
| Precision (NO) | 0.88 | 0.89 | 0.90 |
| Recall (NO) | 0.98 | 0.98 | 0.96 |
| F1-Score (NO) | 0.93 | 0.93 | 0.93 |
| Precision (YES) | 0.74 | 0.77 | 0.71 |
| Recall (YES) | 0.36 | 0.38 | 0.49 |
| F1-Score (YES) | 0.48 | 0.51 | 0.58 |
| AUC Score | 0.665 | 0.685 | 0.726 |
Overall, the Full Scale BEST DEPTH Decision Tree has an edge over the other predictive Models.
Pin 1 : Full Scale model
Impact : Takes into account non contributing variables having minimal impact on the target varaible - SKEWING the model results.
Resolution : Removing non-significant varaibles those having multi-collinearity or those having less significance in Logistic Regression.
Pin 1.1 : Re Train the Model with Best Threshold
Impact & Resolution : Would Further Accept the YES Transaction and Boost the Recall and F1-Score for the YES transaction
PIN 2 : Imbalance in Training Data set
Impact : Over Trains on the NO Transaction and over predicts NO Transaction. Under Trains on YES Transaction and under predicts YES Transaction leading to low precision, recall and f1-score for YES Transaction
Resolution : Training the Model with 50%-50% of YES and NO Transaction records.
log_reg_model = sm.GLM(ytrain, sm.add_constant(Xtrain), family=sm.families.Binomial()).fit()
log_reg_model.summary()
| Dep. Variable: | Transaction Status | No. Observations: | 9864 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 9851 |
| Model Family: | Binomial | Df Model: | 12 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -2915.8 |
| Date: | Tue, 05 Dec 2023 | Deviance: | 5831.7 |
| Time: | 22:47:23 | Pearson chi2: | 4.35e+06 |
| No. Iterations: | 8 | Pseudo R-squ. (CS): | 0.2292 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -1.8175 | 0.102 | -17.841 | 0.000 | -2.017 | -1.618 |
| Administrative | -0.0015 | 0.012 | -0.121 | 0.903 | -0.026 | 0.023 |
| Administrative_Duration | -0.0002 | 0.000 | -1.017 | 0.309 | -0.001 | 0.000 |
| Informational | 0.0645 | 0.029 | 2.197 | 0.028 | 0.007 | 0.122 |
| Informational_Duration | -0.0002 | 0.000 | -0.723 | 0.470 | -0.001 | 0.000 |
| ProductRelated | 0.0032 | 0.001 | 2.533 | 0.011 | 0.001 | 0.006 |
| ProductRelated_Duration | 5.873e-05 | 2.98e-05 | 1.973 | 0.048 | 3.97e-07 | 0.000 |
| BounceRates | -1.4879 | 3.542 | -0.420 | 0.674 | -8.430 | 5.454 |
| ExitRates | -16.6350 | 2.686 | -6.193 | 0.000 | -21.899 | -11.371 |
| PageValues | 0.0802 | 0.003 | 30.517 | 0.000 | 0.075 | 0.085 |
| SpecialDay | -1.1722 | 0.271 | -4.328 | 0.000 | -1.703 | -0.641 |
| VisitorType Status | -0.3353 | 0.094 | -3.575 | 0.000 | -0.519 | -0.151 |
| Weekend Status | 0.1361 | 0.079 | 1.716 | 0.086 | -0.019 | 0.292 |
['Administrative','Informational_Duration','Administrative_Duration', 'BounceRates',"Weekend Status"] posesss very less Significance.
p-value > 0.05
XtrainR = Xtrain.drop(columns = ['Administrative','Informational_Duration','Administrative_Duration', 'BounceRates',"Weekend Status"])
XtestR = Xtest.drop(columns = ['Administrative','Informational_Duration','Administrative_Duration', 'BounceRates',"Weekend Status"])
XtrainR.head()
| Informational | ProductRelated | ProductRelated_Duration | ExitRates | PageValues | SpecialDay | VisitorType Status | |
|---|---|---|---|---|---|---|---|
| 1785 | 0 | 7 | 95.000000 | 0.061905 | 0.000000 | 0.0 | 1 |
| 10407 | 0 | 81 | 1441.910588 | 0.013933 | 2.769599 | 0.0 | 1 |
| 286 | 0 | 1 | 0.000000 | 0.200000 | 0.000000 | 0.0 | 1 |
| 6520 | 4 | 5 | 74.600000 | 0.018182 | 8.326728 | 0.0 | 0 |
| 12251 | 1 | 9 | 279.000000 | 0.041667 | 0.000000 | 0.0 | 0 |
XtestR.head()
| Informational | ProductRelated | ProductRelated_Duration | ExitRates | PageValues | SpecialDay | VisitorType Status | |
|---|---|---|---|---|---|---|---|
| 8916 | 0 | 48 | 1052.255952 | 0.013043 | 0.000000 | 0.0 | 1 |
| 772 | 2 | 83 | 2503.881781 | 0.004916 | 2.086218 | 0.0 | 1 |
| 12250 | 0 | 126 | 4310.004668 | 0.012823 | 3.451072 | 0.0 | 1 |
| 7793 | 0 | 10 | 606.666667 | 0.026389 | 36.672294 | 0.0 | 1 |
| 6601 | 6 | 168 | 4948.398759 | 0.013528 | 10.150644 | 0.0 | 1 |
reduced_logR = LogisticRegression()
reduced_logR.fit(XtrainR, ytrain)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
ytrain_predicted_reduced_logR = reduced_logR.predict(XtrainR)
ytrain_predicted_prob_reduced_logR = reduced_logR.predict_proba(XtrainR)
ytest_predicted_reduced_logR = reduced_logR.predict(XtestR)
ytest_predicted_prob_reduced_logR = reduced_logR.predict_proba(XtestR)
print("Accuracy Score (train):", accuracy_score(y_pred=ytrain_predicted_reduced_logR,y_true= ytrain))
print("Accuracy Score (test):",accuracy_score(y_pred=ytest_predicted_reduced_logR,y_true= ytest))
Accuracy Score (train): 0.889294403892944 Accuracy Score (test): 0.8690186536901865
sns.heatmap(confusion_matrix(ytest, ytest_predicted_reduced_logR), annot = True, fmt = 'g')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title("Confusion Matrix for Reduced Log R Model")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 149 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 2000 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 55 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 262 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
The High values of False Negatives is much concerning as the model is Over Predicting Customers as NO Transaction even though they have made the transaction
TP_reduced_logR = confusion_matrix(ytest, ytest_predicted_reduced_logR)[1,1]
TN_reduced_logR =confusion_matrix(ytest, ytest_predicted_reduced_logR)[0,0]
FP_reduced_logR = confusion_matrix(ytest, ytest_predicted_reduced_logR)[0,1]
FN_reduced_logR = confusion_matrix(ytest, ytest_predicted_reduced_logR)[1,0]
(TP_reduced_logR + TN_reduced_logR )/ (TP_reduced_logR + TN_reduced_logR +FP_reduced_logR +FN_reduced_logR)
0.8690186536901865
print(classification_report(ytest, ytest_predicted_reduced_logR))
precision recall f1-score support
0 0.88 0.97 0.93 2055
1 0.72 0.35 0.47 411
accuracy 0.87 2466
macro avg 0.80 0.66 0.70 2466
weighted avg 0.86 0.87 0.85 2466
The Recall for YES Transaction here is still super LOW just 0.36 worse than a 50-50 Classifier.
There is also still significant difference between Precision of YES TRANSACTION (0.73) and Precision of NO TRANSACTION (0.88)
The F1-Score for YES Transaction is still quite low (0.48)
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
Lets Complete the Model Evaluation for the Reduced LogR Model
fpr_reduced_logR,tpr_reduced_logR, threshold_reduced_logR = roc_curve(ytest,reduced_logR.predict_proba(XtestR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_reduced_logR,tpr_reduced_logR,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Log Model")
plt.legend()
plt.show()
roc_auc_score(ytest,ytest_predicted_reduced_logR)
0.6605839416058393
Reduced Log Model Summary
Overall, this means that our Reduced model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
| Comparision | Full LogR | Reduced - LogR |
|---|---|---|
| Model Accuracy | 87.18% | 87.14% |
| False Negatives | 265 | 262 |
| False Positives | 51 | 55 |
| Precision (NO) | 0.88 | 0.88 |
| Recall (NO) | 0.98 | 0.97 |
| F1-Score (NO) | 0.93 | 0.93 |
| Precision (YES) | 0.74 | 0.73 |
| Recall (YES) | 0.36 | 0.36 |
| F1-Score (YES) | 0.48 | 0.48 |
| AUC Score | 0.665 | 0.667 |
The Full Scale and the Reduced LogR Model are behaving quite Similar to one another. Not much Difference.
However, the Reduced LogR → Maintains the Same Precision, Recall and F1-Score with a little improvement in over predicitons of FN & increasing AUC than the Full Scale Model
scaler_knn = StandardScaler() # Create a scaler object
scaler_knn.fit(XtrainR) # using the fit function I will provide the base mean and standard deviation to the object
XtrainScaledR = scaler_knn.transform(XtrainR) # Now the values for the XtrainR has been scaled
XtestScaledR = scaler_knn.transform(XtestR) # Now the values for the XtestR has been scaled
Choosing a random k = 2
# Create the object of the KNeighboursClassifier with the
reduced_knn = KNeighborsClassifier(n_neighbors=2)
reduced_knn
KNeighborsClassifier(n_neighbors=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=2)
reduced_knn.fit(XtrainScaledR,ytrain)
KNeighborsClassifier(n_neighbors=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=2)
ytrainPredicted_reduced_knn = reduced_knn.predict(XtrainScaledR) # Store the predictions from the model on the train dataset
ytestPredicted_reduced_knn = reduced_knn.predict(XtestScaledR) # Store the predictions from the model on the test dataset
print("Accuracy Score (train):", accuracy_score(y_pred=ytrainPredicted_reduced_knn,y_true= ytrain))
print("Accuracy Score (test):",accuracy_score(y_pred=ytestPredicted_reduced_knn,y_true= ytest))
Accuracy Score (train): 0.9274128142741281 Accuracy Score (test): 0.8653690186536902
Accuracy_dict_reduced_knn = {
"N":[],
"train_acc" : [],
"test_acc" :[]
}
accuracy_df_reduced_knn=pd.DataFrame(Accuracy_dict_reduced_knn)
accuracy_df_reduced_knn
| N | train_acc | test_acc |
|---|
for i in range(1,50):
new_row = []
reduced_knn_ = KNeighborsClassifier(n_neighbors=i)
reduced_knn_.fit(XtrainScaled,ytrain)
ytrainPredicted_reduced_knn_ = reduced_knn_.predict(XtrainScaled) # Store the predictions from the model on the train dataset
ytestPredicted_reduced_knn_ = reduced_knn_.predict(XtestScaled) # Store the predictions from the model on the test dataset
new_row.append(i)
new_row.append(accuracy_score(y_true=ytrain,y_pred=ytrainPredicted_reduced_knn_))
new_row.append(accuracy_score(y_true=ytest,y_pred=ytestPredicted_reduced_knn_))
accuracy_df_reduced_knn.loc[len(accuracy_df_reduced_knn)] = new_row
accuracy_df_reduced_knn.head()
| N | train_acc | test_acc | |
|---|---|---|---|
| 0 | 1.0 | 0.999696 | 0.844282 |
| 1 | 2.0 | 0.928325 | 0.871857 |
| 2 | 3.0 | 0.926399 | 0.873479 |
| 3 | 4.0 | 0.911598 | 0.877129 |
| 4 | 5.0 | 0.914538 | 0.875912 |
fig = px.line(x = accuracy_df_reduced_knn["N"],y=[accuracy_df_reduced_knn["train_acc"],accuracy_df_reduced_knn['test_acc']],labels={"variable":"Accuracy Type","value":"Accuracy"})
fig["data"][0]["name"] ="Train Accuracy"
fig["data"][1]["name"] ="Test Accuracy"
fig.show()
reduced_knn_final = KNeighborsClassifier(n_neighbors=11)
reduced_knn_final.fit(XtrainScaledR,ytrain)
ytestPredicted_reduced_knn_final = reduced_knn_final.predict(XtestScaledR)
accuracy_score(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final)*100
87.87510137875101
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final)
array([[1973, 82],
[ 217, 194]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced KNN Model")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 194 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1973 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 82 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 217 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
The High values of False Negatives is much concerning as the model is Over Predicting Customers as NO Transaction even though they have made the transaction
TP_reduced_knn_final = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final)[1,1]
TN_reduced_knn_final =confusion_matrix(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final)[0,0]
FP_reduced_knn_final = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final)[0,1]
FN_reduced_knn_final = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_reduced_knn_final)[1,0]
(TP_reduced_knn_final + TN_reduced_knn_final )/ (TP_reduced_knn_final + TN_knn_full_final +FP_reduced_knn_final +FN_reduced_knn_final)
0.8661071143085531
print(classification_report(ytest,ytestPredicted_reduced_knn_final))
precision recall f1-score support
0 0.90 0.96 0.93 2055
1 0.70 0.47 0.56 411
accuracy 0.88 2466
macro avg 0.80 0.72 0.75 2466
weighted avg 0.87 0.88 0.87 2466
The Recall for YES Transaction here is still LOW just 0.47 worse than a 50-50 Classifier.
There is also still significant difference between Precision of YES TRANSACTION (0.70) and Precision of NO TRANSACTION (0.90)
The F1-Score for YES Transaction is still low (0.56)
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
Overall, recall and F1-Score have improved slightly then the full model
Lets Complete the Model Evaluation for the Reduced Best N KNN Model
fpr_reduced_knn_final,tpr_reduced_knn_final, threshold_reduced_knn_final = roc_curve(ytest,reduced_knn_final.predict_proba(XtestScaledR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_reduced_knn_final,tpr_reduced_knn_final,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced BEST-K KNN Log Model")
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_reduced_knn_final)
0.7160583941605839
BEST N Full Scale KNN Model
Overall, this means that our BEST N Full Scale KNN Model model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e poorly identify customers who made a purchase on Cliffords'Website.
BEST-K Reduced KNN Model
Overall, this means that our BEST N Reduced KNN Model model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e poorly identify customers who made a purchase on Cliffords'Website.
| Comparision | Full KNN | Reduced - KNN |
|---|---|---|
| Model Accuracy | 87.87% | 87.87% |
| False Negatives | 253 | 217 |
| False Positives | 46 | 82 |
| Precision (NO) | 0.89 | 0.90 |
| Recall (NO) | 0.98 | 0.96 |
| F1-Score (NO) | 0.93 | 0.93 |
| Precision (YES) | 0.74 | 0.73 |
| Recall (YES) | 0.38 | 0.47 |
| F1-Score (YES) | 0.51 | 0.56 |
| AUC Score | 0.681 | 0.716 |
In a nutshell,
The Full Scale and the Reduced KNN Model are behaving quite Similar to one another. But the Reduced KNN is little Better.
However, the Reduced KNN is better than Full KNN in terms of
→ Less over predicitons of FN
→ Increase in Recall (YES)
→ Increase in F1-Score (YES)
→ Increase in Precision (NO)
→ increasing AUC
than the Full Scale Model
Develop a Reduced Data but full depth grown tree
reduced_dt = DecisionTreeClassifier()
reduced_dt.fit(XtrainR,ytrain)
reduced_dt
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier()
yTrainPredicted_reduced_dt = reduced_dt.predict(XtrainR)
yTestPredicted_reduced_dt = reduced_dt.predict(XtestR)
yTrainPredicted_reduced_dt.size,yTestPredicted_reduced_dt.size
(9864, 2466)
print("Accuracy Score (train):", accuracy_score(y_pred=yTrainPredicted_reduced_dt,y_true= ytrain))
print("Accuracy Score (test):",accuracy_score(y_pred=yTestPredicted_reduced_dt,y_true= ytest))
Accuracy Score (train): 0.9991889699918897 Accuracy Score (test): 0.8491484184914841
plt.figure(figsize=(20,15))
plot_tree(reduced_dt)
plt.show()
We now need to generalize / prune the tree
Find the perfect depth of the tree.
Accuracy_dict_reduced_dt = {
"N":[],
"train_acc" : [],
"test_acc" :[]
}
accuracy_df_reduced_dt=pd.DataFrame(Accuracy_dict_reduced_dt)
accuracy_df_reduced_dt
| N | train_acc | test_acc |
|---|
for i in range(1,20):
new_row=[]
dt_reduced_dt = DecisionTreeClassifier(random_state=1,max_depth=i)
dt_reduced_dt.fit(Xtrain,ytrain)
yTrainPredicted_reduced_dt = dt_reduced_dt.predict(Xtrain)
yTestPredicted_reduced_dt = dt_reduced_dt.predict(Xtest)
new_row.append(i)
new_row.append(accuracy_score(y_true=ytrain,y_pred=yTrainPredicted_reduced_dt))
new_row.append(accuracy_score(y_true=ytest,y_pred=yTestPredicted_reduced_dt))
accuracy_df_reduced_dt.loc[len(accuracy_df_reduced_dt)] = new_row
accuracy_df_reduced_dt.head()
| N | train_acc | test_acc | |
|---|---|---|---|
| 0 | 1.0 | 0.876723 | 0.872263 |
| 1 | 2.0 | 0.893958 | 0.876318 |
| 2 | 3.0 | 0.896796 | 0.879157 |
| 3 | 4.0 | 0.903893 | 0.881995 |
| 4 | 5.0 | 0.906427 | 0.880373 |
import plotly.express as px
fig = px.line(x = accuracy_df_reduced_dt["N"],y=[accuracy_df_reduced_dt["train_acc"],accuracy_df_reduced_dt['test_acc']])
fig["data"][0]["name"] ="Train Accuracy"
fig["data"][1]["name"] ="Test Accuracy"
fig.show()
Lets Try to train the BEST Depth Reduced Tree and Evaluate the Model
dt_reduced_dt = DecisionTreeClassifier(random_state=1,max_depth=7)
dt_reduced_dt.fit(XtrainR,ytrain)
yTrainPredicted_dt_reduced_dt = dt_reduced_dt.predict(XtrainR)
yTestPredicted_dt_reduced_dt = dt_reduced_dt.predict(XtestR)
plt.figure(figsize=(50,25))
plot_tree(dt_reduced_dt,feature_names=list(XtrainR.columns),filled=True,class_names=["NO Transaction","YES Transaction"],fontsize=18)
plt.show()
confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt)
array([[1977, 78],
[ 197, 214]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt),annot=True,fmt="g")
plt.xlabel("Prediction Values")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Full Decision Tree")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 214 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1977 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 78 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 197 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
The High values of False Negatives is much concerning as the model is Over Predicting Customers as NO Transaction even though they have made the transaction
TP_dt_reduced_dt = confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt)[1,1]
TN_dt_reduced_dt =confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt)[0,0]
FP_dt_reduced_dt = confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt)[0,1]
FN_dt_reduced_dt = confusion_matrix(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt)[1,0]
(TP_dt_reduced_dt + TN_dt_reduced_dt )/ (TP_dt_reduced_dt + TN_dt_reduced_dt +FP_dt_reduced_dt +FN_dt_reduced_dt)
0.8884833738848338
print(classification_report(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt))
precision recall f1-score support
0 0.91 0.96 0.93 2055
1 0.73 0.52 0.61 411
accuracy 0.89 2466
macro avg 0.82 0.74 0.77 2466
weighted avg 0.88 0.89 0.88 2466
Even the Reduced DT Is unable to better classify the YES Transaction with High Precision and High Recall.
This Tree needs to be improved.
The Recall for YES Transaction here is super LOW just 0.52 a little better than a 50-50 Classifier.
There is also still significant difference between Precision of YES TRANSACTION (0.73) and Precision of NO TRANSACTION (0.91)
The F1-Score for YES Transaction is quite low (0.61)
Overall, this means that our model is very well able to classify NO Transaction but is poorly classifying the YES Transaction. i.e correctly identify customers who made a purchase on Cliffords'Website.
Lets Complete the Model Evaluation for the Reduced Best Depth Decision Tree
dt_reduced_dt.feature_importances_
array([0.00611547, 0.02887759, 0.05061364, 0.0604334 , 0.8222181 ,
0.00207366, 0.02966814])
pd.DataFrame({
"Feature":XtrainR.columns,
"Importance":dt_reduced_dt.feature_importances_
})
| Feature | Importance | |
|---|---|---|
| 0 | Informational | 0.006115 |
| 1 | ProductRelated | 0.028878 |
| 2 | ProductRelated_Duration | 0.050614 |
| 3 | ExitRates | 0.060433 |
| 4 | PageValues | 0.822218 |
| 5 | SpecialDay | 0.002074 |
| 6 | VisitorType Status | 0.029668 |
fpr_dt_reduced_dt,tpr_dt_reduced_dt, threshold_dt_reduced_dt = roc_curve(ytest,dt_reduced_dt.predict_proba(XtestR)[:,1])
plt.plot(fpr_dt_reduced_dt,tpr_dt_reduced_dt,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Depth Decision Tree")
plt.legend()
plt.show()
roc_auc_score(ytest,yTestPredicted_dt_reduced_dt)
0.7413625304136253
Reduced BEST Depth Decision Tree Model
Overall, this means that our BEST Depth Reduced Decision Tree model is very well able to classify NO Transaction but is still poorly classifying the YES Transaction. i.e poorly identify customers who made a purchase on Cliffords'Website.
| Comparision | Full DT | Reduced - DT |
|---|---|---|
| Model Accuracy | 88.19% | 88.88% |
| False Negatives | 208 | 197 |
| False Positives | 83 | 78 |
| Precision (NO) | 0.90 | 0.91 |
| Recall (NO) | 0.96 | 0.96 |
| F1-Score (NO) | 0.93 | 0.93 |
| Precision (YES) | 0.71 | 0.73 |
| Recall (YES) | 0.49 | 0.52 |
| F1-Score (YES) | 0.58 | 0.61 |
| AUC Score | 0.7261 | 0.7413 |
In a nutshell,
The Full Scale and the Reduced DT Model are behaving quite Similar to one another. But the Reduced DT is Better.
However, the Reduced DT is better than Full DT in terms of
→ Higher Accuracy
→ Less over predicitons of FN
→ Increase in Precision(NO)
→ Increase in Recall (YES)
→ Increase in F1-Score (YES)
→ Increase in AUC
than the Full Scale Model
| Comparision | Reduced LogR | Reduced BEST-K KNN | Reduced BEST-Depth DT |
|---|---|---|---|
| Model Accuracy | 87.14% | 87.87% | 88.88% |
| False Negatives | 262 | 217 | 197 |
| False Positives | 55 | 82 | 78 |
| Precision (NO) | 0.88 | 0.90 | 0.91 |
| Recall (NO) | 0.97 | 0.96 | 0.96 |
| F1-Score (NO) | 0.93 | 0.93 | 0.93 |
| Precision (YES) | 0.73 | 0.70 | 0.73 |
| Recall (YES) | 0.36 | 0.47 | 0.52 |
| F1-Score (YES) | 0.48 | 0.56 | 0.61 |
| AUC Score | 0.667 | 0.716 | 0.7413 |
Overall, the Reduced BEST DEPTH Decision Tree has quite an edge over the other predictive Models.
It has been evident that Reduced Models even after being Simpler than the FULL Model holds the capacity to generate atleast same but better results than their FULL Models.
Still, the Reduced Models have higher FN which is concerning and Lower Recall.
This means that the reduced models are unable to predict not all but most YES Transactions Properly to match up the recall and precision of NO Transaction.
They are still over trained on NO Transaction resulting in Lower Recall for (YES Transaction)
Pin 1.1 : Re Train the Model with Best Threshold
Impact & Resolution : Would Further Accept the YES Transaction and Boost the Recall and F1-Score for the YES transaction
PIN 2 : Imbalance in Training Data set
Impact : Over Trains on the NO Transaction and over predicts NO Transaction. Under Trains on YES Transaction and under predicts YES Transaction leading to low precision, recall and f1-score for YES Transaction
Resolution : Training the Model with 50%-50% of YES and NO Transaction records.
print(classification_report(ytest, ytest_predicted_reduced_logR))
precision recall f1-score support
0 0.88 0.97 0.93 2055
1 0.72 0.35 0.47 411
accuracy 0.87 2466
macro avg 0.80 0.66 0.70 2466
weighted avg 0.86 0.87 0.85 2466
fpr_reduced_logR,tpr_reduced_logR, threshold_reduced_logR = roc_curve(ytest,reduced_logR.predict_proba(XtestR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_reduced_logR,tpr_reduced_logR,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Log Model")
plt.legend()
plt.show()
roc_auc_score(ytest,ytest_predicted_reduced_logR)
0.6605839416058393
gmeans_log = np.sqrt(tpr_reduced_logR * (1-fpr_reduced_logR))
ix_log = np.argmax(gmeans_log)
ix_log
284
print('Best Threshold=%f, G-Mean=%.3f' % (threshold_reduced_logR[ix_log], gmeans_log[ix_log]))
Best Threshold=0.122371, G-Mean=0.826
The Best threshold here is still 0.1436
→The reason behind less threshold is we need to reduce the Over False Negative Predictions and predict as much as True Positives minimally misclassify under False Positives.
→In layman's terms, Reducing Threshold will relax our cap in prediciting YES Transaction leading to more predictions turning as YES Transaction with a minimal mis-identification of a NO Transaction as YES Transaction
ytrainPredicted_threshold_reduced_logR = (reduced_logR.predict_proba(XtrainR)[:, 1] > threshold_reduced_logR[ix_log]).astype('float')
ytestPredicted_threshold_reduced_logR = (reduced_logR.predict_proba(XtestR)[:, 1] > threshold_reduced_logR[ix_log]).astype('float')
accuracy_score(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR)*100
83.29278183292782
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR)
array([[1720, 335],
[ 77, 334]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced LogR Model On Best Threshold")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 317 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1775 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 280 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 94 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
TP_threshold_reduced_logR = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR)[1,1]
TN_threshold_reduced_logR = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR)[0,0]
FP_threshold_reduced_logR = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR)[1,0]
FN_threshold_reduced_logR = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_logR)[0,1]
(TP_threshold_reduced_logR + TN_threshold_reduced_logR )/ (TP_threshold_reduced_logR + TN_threshold_reduced_logR +FP_threshold_reduced_logR +FN_threshold_reduced_logR)
0.8329278183292782
print(classification_report(ytest,ytestPredicted_threshold_reduced_logR))
precision recall f1-score support
0 0.96 0.84 0.89 2055
1 0.50 0.81 0.62 411
accuracy 0.83 2466
macro avg 0.73 0.82 0.76 2466
weighted avg 0.88 0.83 0.85 2466
The Recall for YES Transaction here has improved from (0.36) to 0.77
There is also significant difference between Precision of YES TRANSACTION (0.53) and Precision of NO TRANSACTION (0.95)
The F1-Score for YES Transaction has improved (0.48) to 0.63
Overall, This is the best recall and f1-score that can be achieved using these parameters for Log R Model
Lets Complete the Model Evaluation for the Reduced Best Threshold Reduced LogR Model
fpr_threshold_reduced_logR,tpr_threshold_reduced_logR, threshold_threshold_reduced_logR= roc_curve(ytest,reduced_logR.predict_proba(XtestR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_threshold_reduced_logR,tpr_threshold_reduced_logR,label="Best FPR vs TPR")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Threshold LogR Model")
plt.scatter(fpr_threshold_reduced_logR[ix_log], tpr_threshold_reduced_logR[ix_log], marker='o', color='black', label='Best')
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_threshold_reduced_logR)
0.8248175182481752
BEST Threshold Reduced Log Reg Summary
Overall, this means that our Reduced model is very well able to classify YES Transaction and a cost of incorrectly predicting No Transactions → YES Transaction
| Comparision | Full LogR | Reduced - LogR | Best Threshold - Reduced LogR |
|---|---|---|---|
| Model Accuracy | 87.18% | 87.14% | 84.83% |
| False Negatives | 265 | 262 | 94 |
| False Postives | 51 | 55 | 280 |
| Precision (NO) | 0.88 | 0.88 | 0.95 |
| Recall (NO) | 0.98 | 0.97 | 0.86 |
| F1-Score (NO) | 0.93 | 0.93 | 0.90 |
| Precision (YES) | 0.74 | 0.73 | 0.53 |
| Recall (YES) | 0.36 | 0.36 | 0.77 |
| F1-Score (YES) | 0.48 | 0.48 | 0.63 |
| AUC Score | 0.665 | 0.667 | 0.8175 |
Reduced Model with Best Threshold Evaluation
There is quite an improvement in :
Cost of Improvement - We are losing at certain Aspects :
As the threshold of 0.143 is very very Low
The concerning part here is that Log Model is classifying almost every 6 out of 7 clients as YES Transaction. Sometime the model is true as well.
BUT The model is making higher mistakes too.. It is predicting NO Transaction Customers as YES Transaction
This is disastrous.
Due to misclassifaction, Clifford shall miss out reaching out to those Customers Who are on the Edge of making a transaction but did not make a transaction by Classifying them as YES Transaction whereas in reality they did not make a purchase
print(classification_report(ytest,ytestPredicted_reduced_knn_final))
precision recall f1-score support
0 0.90 0.96 0.93 2055
1 0.70 0.47 0.56 411
accuracy 0.88 2466
macro avg 0.80 0.72 0.75 2466
weighted avg 0.87 0.88 0.87 2466
fpr_reduced_knn_final,tpr_reduced_knn_final, threshold_reduced_knn_final = roc_curve(ytest,reduced_knn_final.predict_proba(XtestScaledR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_reduced_knn_final,tpr_reduced_knn_final,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced BEST-K KNN Log Model")
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_reduced_knn_final)
0.7160583941605839
gmeans_knn = np.sqrt(tpr_reduced_knn_final * (1-fpr_reduced_knn_final))
ix_knn = np.argmax(gmeans_knn)
ix_knn
10
print('Best Threshold=%f, G-Mean=%.3f' % (threshold_reduced_knn_final[ix_knn], gmeans_knn[ix_knn]))
Best Threshold=0.181818, G-Mean=0.811
The Best threshold here is still 0.1818
→The reason behind less threshold is we need to reduce the Over False Negative Predictions and predict as much as True Positives minimally misclassify under False Positives.
→In layman's terms, Reducing Threshold will relax our cap in prediciting YES Transaction leading to more predictions turning as YES Transaction with a minimal mis-identification of a NO Transaction as YES Transaction
ytrainPredicted_threshold_reduced_knn = (reduced_knn_final.predict_proba(XtrainScaledR)[:, 1] > threshold_reduced_knn_final[ix_knn]).astype('float')
ytestPredicted_threshold_reduced_knn = (reduced_knn_final.predict_proba(XtestScaledR)[:, 1] >threshold_reduced_knn_final[ix_knn]).astype('float')
accuracy_score(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn)*100
85.56366585563666
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn)
array([[1817, 238],
[ 118, 293]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced KNN Model On Best Threshold")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 293 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1817 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 238 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 118 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
TP_threshold_reduced_knn = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn)[1,1]
TN_threshold_reduced_knn = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn)[0,0]
FP_threshold_reduced_knn = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn)[1,0]
FN_threshold_reduced_knn = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_knn)[0,1]
(TP_threshold_reduced_knn + TN_threshold_reduced_knn )/ (TP_threshold_reduced_knn + TN_threshold_reduced_knn +FP_threshold_reduced_knn +FN_threshold_reduced_knn)
0.8556366585563666
print(classification_report(ytest,ytestPredicted_threshold_reduced_knn))
precision recall f1-score support
0 0.94 0.88 0.91 2055
1 0.55 0.71 0.62 411
accuracy 0.86 2466
macro avg 0.75 0.80 0.77 2466
weighted avg 0.87 0.86 0.86 2466
The Recall for YES Transaction here has improved from (0.47) to 0.71
There is also significant difference between Precision of YES TRANSACTION (0.55) and Precision of NO TRANSACTION (0.94)
The F1-Score for YES Transaction has improved (0.56) to 0.62
Overall, This is the best recall and f1-score that can be achieved using these parameters for Log R Model
Lets Complete the Model Evaluation for the Reduced Best Threshold KNN Model
fpr_threshold_reduced_knn,tpr_threshold_reduced_knn, threshold_threshold_reduced_knn= roc_curve(ytest,reduced_knn_final.predict_proba(XtestScaledR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_threshold_reduced_knn,tpr_threshold_reduced_knn,label="Best FPR vs TPR")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Threshold KNN Model")
plt.scatter(fpr_threshold_reduced_knn[ix_knn], tpr_threshold_reduced_knn[ix_knn], marker='o', color='black', label='Best')
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_threshold_reduced_knn)
0.7985401459854015
BEST Threshold Reduced KNN Summary
| Comparision | Full KNN | Reduced - KNN | Best Threshold - Reduced KNN |
|---|---|---|---|
| Model Accuracy | 87.87% | 87.87% | 85.56% |
| False Negatives | 253 | 217 | 118 |
| False Postives | 46 | 82 | 238 |
| Precision (NO) | 0.89 | 0.90 | 0.94 |
| Recall (NO) | 0.98 | 0.96 | 0.88 |
| F1-Score (NO) | 0.93 | 0.93 | 0.91 |
| Precision (YES) | 0.74 | 0.73 | 0.55 |
| Recall (YES) | 0.38 | 0.47 | 0.71 |
| F1-Score (YES) | 0.51 | 0.56 | 0.62 |
| AUC Score | 0.681 | 0.716 | 0.7985 |
Reduced Model with Best Threshold Evaluation
There is quite an improvement in :
Cost of Improvement - We are losing at certain Aspects :
As the threshold of 0.181818 is very very Low
The concerning part here is that Log Model is classifying almost every 4 out of 5 clients as YES Transaction. Sometime the model is true as well.
BUT The model is making higher mistakes too.. It is predicting NO Transaction Customers as YES Transaction
This is disastrous.
Due to misclassifaction, Clifford shall miss out reaching out to those Customers Who are on the Edge of making a transaction but did not make a transaction by Classifying them as YES Transaction whereas in reality they did not make a purchase
print(classification_report(y_true=ytest,y_pred=yTestPredicted_dt_reduced_dt))
precision recall f1-score support
0 0.91 0.96 0.93 2055
1 0.73 0.52 0.61 411
accuracy 0.89 2466
macro avg 0.82 0.74 0.77 2466
weighted avg 0.88 0.89 0.88 2466
fpr_dt_reduced_dt,tpr_dt_reduced_dt, threshold_dt_reduced_dt = roc_curve(ytest,dt_reduced_dt.predict_proba(XtestR)[:,1])
plt.plot(fpr_dt_reduced_dt,tpr_dt_reduced_dt,label="FPR vs TPR")
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Depth Decision Tree")
plt.legend()
plt.show()
roc_auc_score(ytest,yTestPredicted_dt_reduced_dt)
0.7413625304136253
gmeans_dt = np.sqrt(tpr_dt_reduced_dt * (1-fpr_dt_reduced_dt))
ix_dt = np.argmax(gmeans_dt)
ix_dt
32
print('Best Threshold=%f, G-Mean=%.3f' % (threshold_dt_reduced_dt[ix_dt], gmeans_dt[ix_dt]))
Best Threshold=0.111111, G-Mean=0.843
The Best threshold here is still 0.1111
→The reason behind less threshold is we need to reduce the Over False Negative Predictions and predict as much as True Positives minimally misclassify under False Positives.
→In layman's terms, Reducing Threshold will relax our cap in prediciting YES Transaction leading to more predictions turning as YES Transaction with a minimal mis-identification of a NO Transaction as YES Transaction
ytrainPredicted_threshold_reduced_dt = (dt_reduced_dt.predict_proba(XtrainR)[:, 1] > threshold_dt_reduced_dt[ix_dt]).astype('float')
ytestPredicted_threshold_reduced_dt = (dt_reduced_dt.predict_proba(XtestR)[:, 1] >threshold_dt_reduced_dt[ix_dt]).astype('float')
accuracy_score(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt)*100
83.4955393349554
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt)
array([[1713, 342],
[ 65, 346]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced DT Model On Best Threshold")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 346 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1713 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 342 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 65 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
TP_threshold_reduced_dt = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt)[1,1]
TN_threshold_reduced_dt = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt)[0,0]
FP_threshold_reduced_dt = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt)[1,0]
FN_threshold_reduced_dt = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_threshold_reduced_dt)[0,1]
(TP_threshold_reduced_dt + TN_threshold_reduced_dt )/ (TP_threshold_reduced_dt + TN_threshold_reduced_dt +FP_threshold_reduced_dt +FN_threshold_reduced_dt)
0.8349553933495539
print(classification_report(ytest,ytestPredicted_threshold_reduced_dt))
precision recall f1-score support
0 0.96 0.83 0.89 2055
1 0.50 0.84 0.63 411
accuracy 0.83 2466
macro avg 0.73 0.84 0.76 2466
weighted avg 0.89 0.83 0.85 2466
The Recall for YES Transaction here has improved from (0.52) to 0.84
There is also significant difference between Precision of YES TRANSACTION (0.50) and Precision of NO TRANSACTION (0.96)
The F1-Score for YES Transaction has improved (0.61) to 0.63
Overall, This is the best recall and f1-score that can be achieved using these parameters for Decision Tree Model
Lets Complete the Model Evaluation for the Reduced Best Threshold Decision Tree Model
fpr_threshold_reduced_dt,tpr_threshold_reduced_dt, threshold_threshold_reduced_dt= roc_curve(ytest,dt_reduced_dt.predict_proba(XtestR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_threshold_reduced_dt,tpr_threshold_reduced_dt,label="Best FPR vs TPR")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Threshold KNN Model")
plt.scatter(fpr_threshold_reduced_dt[ix_dt], tpr_threshold_reduced_dt[ix_dt], marker='o', color='black', label='Best')
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_threshold_reduced_dt)
0.837712895377129
BEST Threshold Reduced Decision Tree
Overall, this means that our Reduced model is very well able to classify YES Transaction and a cost of incorrectly predicting No Transactions → YES Transaction
| Comparision | Full DT | Reduced - DT | Best Threshold - Reduced DT |
|---|---|---|---|
| Model Accuracy | 88.19% | 88.88% | 83.49% |
| False Negatives | 208 | 197 | 65 |
| False Postives | 83 | 78 | 342 |
| Precision (NO) | 0.90 | 0.91 | 0.96 |
| Recall (NO) | 0.96 | 0.96 | 0.83 |
| F1-Score (NO) | 0.93 | 0.93 | 0.89 |
| Precision (YES) | 0.71 | 0.73 | 0.50 |
| Recall (YES) | 0.49 | 0.52 | 0.84 |
| F1-Score (YES) | 0.58 | 0.61 | 0.63 |
| AUC Score | 0.7261 | 0.7413 | 0.8377 |
Reduced Model with Best Threshold Evaluation
There is quite an improvement in :
Cost of Improvement - We are losing at certain Aspects :
As the threshold of 0.11111 is very very Low
The concerning part here is that Log Model is classifying almost every 9 out of 10 clients as YES Transaction. Sometime the model is true as well.
BUT The model is making higher mistakes too.. It is predicting NO Transaction Customers as YES Transaction
This is disastrous.
Due to misclassifaction, Clifford shall miss out reaching out to those Customers Who are on the Edge of making a transaction but did not make a transaction by Classifying them as YES Transaction whereas in reality they did not make a purchase
| Comparision | Best Threshold - Reduced LogR Model | Best Threshold - Reduced KNN | Best Threshold - Reduced DT |
|---|---|---|---|
| Model Accuracy | 84.83% | 85.56% | 83.49% |
| False Negatives | 94 | 118 | 65 |
| False Postives | 280 | 238 | 342 |
| Precision (NO) | 0.95 | 0.94 | 0.96 |
| Recall (NO) | 0.86 | 0.88 | 0.83 |
| F1-Score (NO) | 0.90 | 0.91 | 0.89 |
| Precision (YES) | 0.53 | 0.55 | 0.50 |
| Recall (YES) | 0.77 | 0.71 | 0.84 |
| F1-Score (YES) | 0.63 | 0.62 | 0.63 |
| AUC Score | 0.8175 | 0.7985 | 0.8377 |
There is a Cut-Throat Competition Between KNN and Reduced DT
DT Pros
Pros of a BEST Threshold
Cons of BEST Threshold
Conclusively, Best Threshold is theoretically true. However, an actual and Business Acceptable Threshold should be useful.
We know that thresholds are theoretical we need to find a balance of threshold that is acceptable by Business.
threshold (theoretical) → Practical Threshold (??) → Defualt Threshold (0.5)
print('Best Threshold for LogR=%f, G-Mean=%.3f' % (threshold_reduced_logR[ix_log], gmeans_log[ix_log]))
Best Threshold for LogR=0.122371, G-Mean=0.826
acc_test = []
thre = []
acc_train = []
best_log_threshold = threshold_reduced_logR[ix_log]
start_index = np.where(threshold_reduced_logR <= 0.5)[0][0]
end_index = np.where(threshold_reduced_logR == best_log_threshold)[0][0]
for i in range(start_index,end_index):
ytestPredicted_finalR_ThresholdH = (reduced_logR.predict_proba(XtestR)[:, 1] > threshold_reduced_logR[i]).astype('float')
ytrainPredicted_finalR_ThresholdH = (reduced_logR.predict_proba(XtrainR)[:, 1] > threshold_reduced_logR[i]).astype('float')
acc_test.append(accuracy_score(ytest,ytestPredicted_finalR_ThresholdH))
thre.append(threshold_reduced_logR[i])
acc_train.append(accuracy_score(ytrain,ytrainPredicted_finalR_ThresholdH))
fig = px.line(x = thre,y=[acc_train,acc_test])
fig["data"][0]["name"] = "Accuracy Train"
fig["data"][1]["name"] = "Accuracy Test"
fig.show()
The Best Practical Threshold for the LogR Model which seems to be practical is 0.22084 where my model have the best combination of Train Acc () and Highest Test Acc in the defined Range of Thresholds.
ytestPredicted_logR_BT = (reduced_logR.predict_proba(XtestR)[:, 1] > threshold_reduced_logR[np.where(threshold_reduced_logR>=0.22084)[0][-1]]).astype('float')
ytrainPredicted__logR_BT = (reduced_logR.predict_proba(XtrainR)[:, 1] > threshold_reduced_logR[np.where(threshold_reduced_logR>=0.22084)[0][-1]]).astype('float')
threshold_reduced_logR[np.where(threshold_reduced_logR>=0.22084)[0][-1]]
0.22168234482445176
accuracy_score(y_true=ytest,y_pred=ytestPredicted_logR_BT)*100
88.32116788321169
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_logR_BT)
array([[1933, 122],
[ 166, 245]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_logR_BT),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced LogR Model On Business Threshold")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 247 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1932 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 123 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 164 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
TP_logR_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_logR_BT)[1,1]
TN_logR_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_logR_BT)[0,0]
FP_logR_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_logR_BT)[1,0]
FN_logR_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_logR_BT)[0,1]
(TP_logR_BT + TN_logR_BT )/ (TP_logR_BT + TN_logR_BT +FP_logR_BT +FN_logR_BT)
0.8832116788321168
print(classification_report(ytest,ytestPredicted_logR_BT))
precision recall f1-score support
0 0.92 0.94 0.93 2055
1 0.67 0.60 0.63 411
accuracy 0.88 2466
macro avg 0.79 0.77 0.78 2466
weighted avg 0.88 0.88 0.88 2466
The Recall for YES Transaction here has settled to 0.60
The F1-Score for YES Transaction has remained still at 0.63
Overall, This is the best recall and f1-score that can be achieved using these parameters for Log R Model
Lets Complete the Model Evaluation for the Reduced Business Threshold Reduced LogR Model
fpr_threshold_reduced_logR,tpr_threshold_reduced_logR, threshold_threshold_reduced_logR= roc_curve(ytest,reduced_logR.predict_proba(XtestR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_threshold_reduced_logR,tpr_threshold_reduced_logR,label="Best FPR vs TPR")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Threshold LogR Model")
plt.scatter(fpr_threshold_reduced_logR[ix_log], tpr_threshold_reduced_logR[ix_log], marker='o', color='black', label='Best Threshold')
plt.scatter(fpr_threshold_reduced_logR[np.where(threshold_reduced_logR>=0.22084)[0][-1]], tpr_threshold_reduced_logR[np.where(threshold_reduced_logR>=0.22084)[0][-1]], marker='o', color='red', label='Business Threshold')
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_logR_BT)
0.7683698296836984
BEST Threshold Reduced Log Reg Summary
Overall, this means that our Reduced model is able to balancely classify YES and NO Transaction at a minimal cost of incorrectly predicting No Transactions → YES Transaction
| Comparision | Full LogR | Reduced - LogR | Best Threshold - Reduced LogR | Business Threshold - Reduced LogR |
|---|---|---|---|---|
| Model Accuracy | 87.18% | 87.14% | 84.83% | 88.36% |
| False Negatives | 265 | 262 | 94 | 164 - Moderate |
| False Postives | 51 | 55 | 280 | 123 - Moderate |
| Precision (NO) | 0.88 | 0.88 | 0.95 | In Control - 0.92 |
| Recall (NO) | 0.98 | 0.97 | 0.86 | In Control - 0.94 |
| F1-Score (NO) | 0.93 | 0.93 | 0.90 | 0.93 |
| Precision (YES) | 0.74 | 0.73 | 0.53 | Balanced - 0.63 |
| Recall (YES) | 0.36 | 0.36 | 0.77 | Balanced - 0.60 |
| F1-Score (YES) | 0.48 | 0.48 | 0.63 | 0.63 |
| AUC Score | 0.665 | 0.667 | 0.8175 | Balanced - 0.7705 |
As the threshold of 0.22084 is quite Low
The concerning part here is that Log Model is classifying almost every 4 out of 5 clients as YES Transaction. Sometime the model is true as well.
The model is making moderate mistakes too.. It is contrallably predicting NO Transaction Customers as YES Transaction
This Model is equally mis-classifying
• YES Transaction → NO Transaction &
• NO Transaction → YES Transaction
print('Best Threshold for KNN=%f, G-Mean=%.3f' % (threshold_threshold_reduced_knn[ix_knn], gmeans_knn[ix_knn]))
Best Threshold for KNN=0.181818, G-Mean=0.811
acc_test = []
thre = []
acc_train = []
best_knn_threshold = threshold_threshold_reduced_knn[ix_knn]
start_index = np.where(threshold_threshold_reduced_knn <= 0.5)[0][0]
end_index = np.where(threshold_threshold_reduced_knn == best_knn_threshold)[0][0]+1
for i in range(start_index,end_index):
ytestPredicted_finalR_ThresholdH = (reduced_knn_final.predict_proba(XtestScaledR)[:, 1] > threshold_threshold_reduced_knn[i]).astype('float')
ytrainPredicted_finalR_ThresholdH = (reduced_knn_final.predict_proba(XtrainScaledR)[:, 1] > threshold_threshold_reduced_knn[i]).astype('float')
acc_test.append(accuracy_score(ytest,ytestPredicted_finalR_ThresholdH))
thre.append(threshold_threshold_reduced_knn[i])
acc_train.append(accuracy_score(ytrain,ytrainPredicted_finalR_ThresholdH))
fig = px.line(x = thre,y=[acc_train,acc_test])
fig["data"][0]["name"] = "Accuracy Train"
fig["data"][1]["name"] = "Accuracy Test"
fig.show()
The Best Practical Threshold for the KNN Model which seems to be practical is 0.3636 where my model have the best combination of Train Acc () and Highest Test Acc in the defined Range of Thresholds.
ytestPredicted_knn_BT = (reduced_knn_final.predict_proba(XtestScaledR)[:, 1] > threshold_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]]).astype('float')
ytrainPredicted__knn_BT = (reduced_knn_final.predict_proba(XtrainScaledR)[:, 1] > threshold_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]]).astype('float')
threshold_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]]
0.36363636363636365
accuracy_score(y_true=ytest,y_pred=ytestPredicted_knn_BT)*100
87.95620437956204
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_BT)
array([[1943, 112],
[ 185, 226]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_BT),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced KNN Model On Business Threshold")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 226 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1943 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 112 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 185 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
TP_knn_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_BT)[1,1]
TN_knn_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_BT)[0,0]
FP_knn_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_BT)[1,0]
FN_knn_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_knn_BT)[0,1]
(TP_knn_BT + TN_knn_BT )/ (TP_knn_BT + TN_knn_BT +FP_knn_BT +FN_knn_BT)
0.8795620437956204
print(classification_report(ytest,ytestPredicted_knn_BT))
precision recall f1-score support
0 0.91 0.95 0.93 2055
1 0.67 0.55 0.60 411
accuracy 0.88 2466
macro avg 0.79 0.75 0.77 2466
weighted avg 0.87 0.88 0.87 2466
The Recall for YES Transaction here has settled to 0.55
The F1-Score for YES Transaction has remained still at 0.60
Overall, This is the best recall and f1-score that can be achieved using these parameters for KNN Model
Lets Complete the Model Evaluation for the Reduced Business Threshold Reduced KNN Model
fpr_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]]
0.08029197080291971
fpr_threshold_reduced_knn,tpr_threshold_reduced_knn, threshold_threshold_reduced_knn= roc_curve(ytest,reduced_knn_final.predict_proba(XtestScaledR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_threshold_reduced_knn,tpr_threshold_reduced_knn,label="Best FPR vs TPR")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Threshold KNN Model")
plt.scatter(fpr_threshold_reduced_knn[ix_knn], tpr_threshold_reduced_knn[ix_knn], marker='o', color='black', label='Best')
# plt.scatter(fpr_threshold_reduced_logR[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]], tpr_threshold_reduced_logR[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]], marker='o', color='red', label='Business Threshold')
plt.scatter(fpr_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]], tpr_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]], marker='o', color='red', label='Business Threshold')
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_knn_BT)
0.7476885644768856
Business Threshold Reduced KNN Model
Overall, this means that our Reduced model is able to balancely classify YES and NO Transaction at a minimal cost of incorrectly predicting No Transactions → YES Transaction
| Comparision | Full KNN | Reduced - KNN | Best Threshold - Reduced KNN | Business Threshold - Reduced KNN |
|---|---|---|---|---|
| Model Accuracy | 87.87% | 87.87% | 85.56% | 87.95% |
| False Negatives | 253 | 217 | 118 | 185 - Moderate |
| False Postives | 46 | 82 | 238 | 112 - Moderate |
| Precision (NO) | 0.89 | 0.90 | 0.94 | In Control - 0.91 |
| Recall (NO) | 0.98 | 0.96 | 0.88 | In Control - 0.95 |
| F1-Score (NO) | 0.93 | 0.93 | 0.91 | 0.93 |
| Precision (YES) | 0.74 | 0.73 | 0.55 | 0.67 - Balanced |
| Recall (YES) | 0.38 | 0.47 | 0.71 | 0.55 - Balanced |
| F1-Score (YES) | 0.51 | 0.56 | 0.62 | 0.60 - Balanced |
| AUC Score | 0.681 | 0.716 | 0.7985 | 0.7476 - Balanced |
As the threshold of 0.3636 is Acceptable
The KNN Model is classifying almost every 7 out of 11 clients as YES Transaction. Sometime the model is true as well.
The model is making moderate mistakes too.. It is contrallably predicting NO Transaction Customers as YES Transaction
This Model is equally mis-classifying
• YES Transaction → NO Transaction &
• NO Transaction → YES Transaction
print('Best Threshold for Decision Trees = %f, G-Mean=%.3f' % (threshold_dt_reduced_dt[ix_dt], gmeans_dt[ix_dt]))
Best Threshold for Decision Trees = 0.111111, G-Mean=0.843
acc_test = []
thre = []
acc_train = []
best_dt_threshold = threshold_dt_reduced_dt[ix_dt]
start_index = np.where(threshold_dt_reduced_dt <= 0.5)[0][0]
end_index = np.where(threshold_dt_reduced_dt == best_dt_threshold)[0][0]+1
for i in range(start_index,end_index):
ytestPredicted_finalR_ThresholdH = (dt_reduced_dt.predict_proba(XtestR)[:, 1] > threshold_dt_reduced_dt[i]).astype('float')
ytrainPredicted_finalR_ThresholdH = (dt_reduced_dt.predict_proba(XtrainR)[:, 1] > threshold_dt_reduced_dt[i]).astype('float')
acc_test.append(accuracy_score(ytest,ytestPredicted_finalR_ThresholdH))
thre.append(threshold_dt_reduced_dt[i])
acc_train.append(accuracy_score(ytrain,ytrainPredicted_finalR_ThresholdH))
fig = px.line(x = thre,y=[acc_train,acc_test])
fig["data"][0]["name"] = "Accuracy Train"
fig["data"][1]["name"] = "Accuracy Test"
fig.show()
The Best Practical Threshold for the Decision Tree Model which seems to be practical is 0.3888 where my model have the best combination of Train Acc () and Highest Test Acc in the defined Range of Thresholds.
ytestPredicted_dt_BT = (dt_reduced_dt.predict_proba(XtestR)[:, 1] > threshold_dt_reduced_dt[np.where(threshold_dt_reduced_dt>=0.3888)[0][-1]]).astype('float')
ytrainPredicted__dt_BT = (dt_reduced_dt.predict_proba(XtrainR)[:, 1] > threshold_dt_reduced_dt[np.where(threshold_dt_reduced_dt>=0.3888)[0][-1]]).astype('float')
threshold_dt_reduced_dt[np.where(threshold_dt_reduced_dt>=0.3888)[0][-1]]
0.3888888888888889
accuracy_score(y_true=ytest,y_pred=ytestPredicted_dt_BT)*100
88.44282238442823
confusion_matrix(y_true=ytest,y_pred=ytestPredicted_dt_BT)
array([[1938, 117],
[ 168, 243]], dtype=int64)
sns.heatmap(confusion_matrix(y_true=ytest,y_pred=ytestPredicted_dt_BT),annot=True,fmt='g')
plt.xlabel("Predictions")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix for Reduced Decision Tree Model On Business Threshold")
plt.show()
So here,
0 - Negative or No Transaction and 1 - Positive or Yes Transaction
True Positive = 243 (The Model has correctly predicted Customers who made the Transaction)
True Negatives = 1938 (The model has correctly predicted Customers who NOT made the Transaction)
False Positives= 117 (The Model has incorrectly predicted as YES Transaction but Actually the customer NOT made the Transaction)
False Negtives = 168 (The Model has incorrectly predicted that Customer NOT made the Transaction but Actually the customer made the Transaction)
TP_dt_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_dt_BT)[1,1]
TN_dt_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_dt_BT)[0,0]
FP_dt_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_dt_BT)[1,0]
FN_dt_BT = confusion_matrix(y_true=ytest,y_pred=ytestPredicted_dt_BT)[0,1]
(TP_dt_BT + TN_dt_BT )/ (TP_dt_BT + TN_dt_BT +FP_dt_BT +FN_dt_BT)
0.8844282238442822
print(classification_report(ytest,ytestPredicted_dt_BT))
precision recall f1-score support
0 0.92 0.94 0.93 2055
1 0.68 0.59 0.63 411
accuracy 0.88 2466
macro avg 0.80 0.77 0.78 2466
weighted avg 0.88 0.88 0.88 2466
The Recall for YES Transaction here has settled to 0.59
The F1-Score for YES Transaction has remained still at 0.63
Overall, This is the best recall and f1-score that can be achieved using these parameters for Decision Tree Model
Lets Complete the Model Evaluation for the Reduced Business Threshold Reduced Decision Tree Model
fpr_threshold_reduced_knn[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]]
0.08029197080291971
fpr_dt_reduced_dt,tpr_dt_reduced_dt, threshold_dt_reduced_dt= roc_curve(ytest,dt_reduced_dt.predict_proba(XtestR)[:,1])
import matplotlib.pyplot as plt
plt.plot(fpr_dt_reduced_dt,tpr_dt_reduced_dt,label="Best FPR vs TPR")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Reduced Best Threshold Decision Tree Model")
plt.scatter(fpr_dt_reduced_dt[ix_dt], tpr_dt_reduced_dt[ix_dt], marker='o', color='black', label='Best Threshold')
# plt.scatter(fpr_threshold_reduced_logR[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]], tpr_threshold_reduced_logR[np.where(threshold_threshold_reduced_knn>=0.3636)[0][-1]], marker='o', color='red', label='Business Threshold')
plt.scatter(fpr_dt_reduced_dt[np.where(threshold_dt_reduced_dt>=0.3888)[0][-1]], tpr_dt_reduced_dt[np.where(threshold_dt_reduced_dt>=0.3888)[0][-1]], marker='o', color='red', label='Business Threshold')
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.legend()
plt.show()
roc_auc_score(ytest,ytestPredicted_dt_BT)
0.7671532846715329
Business Threshold Reduced Decision Tree Model
Overall, this means that our Reduced model is able to balancely classify YES and NO Transaction at a minimal cost of incorrectly predicting No Transactions → YES Transaction
| Comparision | Full DT | Reduced - DT | Best Threshold - Reduced DT | Business Threshold - Reduced DT |
|---|---|---|---|---|
| Model Accuracy | 88.19% | 88.88% | 83.49% | 88.44% |
| False Negatives | 208 | 197 | 65 | Moderate - 168 |
| False Postives | 83 | 78 | 342 | 117 - Moderate |
| Precision (NO) | 0.90 | 0.91 | 0.96 | In Control - 0.92 |
| Recall (NO) | 0.96 | 0.96 | 0.83 | In Control 0.94 |
| F1-Score (NO) | 0.93 | 0.93 | 0.89 | 0.93 |
| Precision (YES) | 0.71 | 0.73 | 0.50 | 0.68 - Balanced |
| Recall (YES) | 0.49 | 0.52 | 0.84 | 0.59 - Balanced |
| F1-Score (YES) | 0.58 | 0.61 | 0.63 | 0.63 |
| AUC Score | 0.7261 | 0.7413 | 0.8377 | 0.7671 - Balanced |
As the threshold of 0.3888 is Acceptable
The Decision Tree Model is classifying almost every 6 out of 10 clients as YES Transaction. Sometime the model is true as well.
The model is making moderate mistakes too.. It is contrallably predicting NO Transaction Customers as YES Transaction
This Model is equally mis-classifying
• YES Transaction → NO Transaction &
• NO Transaction → YES Transaction
Business-threshold is improving False Positive performance at the cost of False Negative as mentioned earlier. This will allow business to target more customer who are on the verge of making an transaction. It is possible at the cost of added double marketing to similar number of cusotmers who are already transacting. Double marketing has a potential to further improve sales from those customers due to new promotion. Overall, business would benefit more from business threshold model compared to best threshold model.
| Comparision | Business Threshold - Reduced LogR Model | Business Threshold - Reduced KNN | Business Threshold - Reduced DT |
|---|---|---|---|
| Practical Threshold | 0.2208 | 0.3636 | 0.3888 |
| Model Accuracy | 88.36% | 87.95% | 88.44% |
| False Negatives | 164 | 185 | 168 |
| False Postives | 123 | 112 | 117 |
| Precision (NO) | 0.92 | 0.91 | 0.92 |
| Recall (NO) | 0.94 | 0.95 | 0.94 |
| F1-Score (NO) | 0.93 | 0.93 | 0.93 |
| Precision (YES) | 0.63 | 0.67 | 0.68 |
| Recall (YES) | 0.60 | 0.55 | 0.59 |
| F1-Score (YES) | 0.63 | 0.60 | 0.63 |
| AUC Score | 0.7705 | 0.7476 | 0.7671 - Moderate |
Conclusively, Decision Tree Model Reduced with Business Threshold is the Optimal Model in Real World.
Classification and Errors are balanced resulting into a win-win situation on both ENDS.
First, we will scale the data and try to find the optimal numbers of clusters.
df_kmeans = df.iloc[:,:13] # creating a seperate dataframe for kmeans clustering
df_kmeans
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | VisitorType Status | Weekend Status | Transaction Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.000000 | 0.0 | 1 | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.000000 | 0.100000 | 0.000000 | 0.0 | 1 | 0 | 0 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.000000 | 0.0 | 1 | 0 | 0 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.050000 | 0.140000 | 0.000000 | 0.0 | 1 | 0 | 0 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.020000 | 0.050000 | 0.000000 | 0.0 | 1 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12325 | 3 | 145.0 | 0 | 0.0 | 53 | 1783.791667 | 0.007143 | 0.029031 | 12.241717 | 0.0 | 1 | 1 | 0 |
| 12326 | 0 | 0.0 | 0 | 0.0 | 5 | 465.750000 | 0.000000 | 0.021333 | 0.000000 | 0.0 | 1 | 1 | 0 |
| 12327 | 0 | 0.0 | 0 | 0.0 | 6 | 184.250000 | 0.083333 | 0.086667 | 0.000000 | 0.0 | 1 | 1 | 0 |
| 12328 | 4 | 75.0 | 0 | 0.0 | 15 | 346.000000 | 0.000000 | 0.021053 | 0.000000 | 0.0 | 1 | 0 | 0 |
| 12329 | 0 | 0.0 | 0 | 0.0 | 3 | 21.250000 | 0.000000 | 0.066667 | 0.000000 | 0.0 | 0 | 1 | 0 |
12330 rows × 13 columns
scaler = StandardScaler()
scaler.fit(df_kmeans)
df_kmeans_Scaled = scaler.transform(df_kmeans)
df_kmeans_Scaled2 = normalize(df_kmeans)
Later we will compare the clusters formed using these 2 versions of scaled data.
# Data scaled using Standard Scaler
inertia=[]
number_of_clusters = range(2,50)
for i in number_of_clusters:
km = KMeans(n_clusters = i, random_state = 0)
km.fit(df_kmeans_Scaled)
inertia.append(km.inertia_)
print("Cluster for i = ",i," Completed")
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 2 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 3 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 4 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 5 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 6 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 7 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 8 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 9 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 10 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 11 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 12 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 13 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 14 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 15 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 16 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 17 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 18 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 19 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 20 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 21 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 22 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 23 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 24 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 25 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 26 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 27 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 28 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 29 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 30 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 31 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 32 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 33 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 34 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 35 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 36 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 37 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 38 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 39 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 40 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 41 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 42 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 43 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 44 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 45 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 46 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 47 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 48 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 49 Completed
plt.plot(number_of_clusters, inertia)
plt.show()
# data scaled using Normalize
inertia2=[]
number_of_clusters2 = range(2,50)
for i in number_of_clusters2:
km2 = KMeans(n_clusters = i, random_state = 0)
km2.fit(df_kmeans_Scaled2)
inertia2.append(km2.inertia_)
print("Cluster for i = ",i," Completed")
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 2 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 3 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 4 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 5 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 6 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 7 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 8 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 9 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 10 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 11 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 12 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 13 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 14 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 15 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 16 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 17 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 18 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 19 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 20 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 21 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 22 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 23 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 24 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 25 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 26 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 27 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 28 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 29 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 30 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 31 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 32 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 33 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 34 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 35 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 36 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 37 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 38 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 39 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 40 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 41 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 42 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 43 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 44 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 45 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 46 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 47 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 48 Completed
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Cluster for i = 49 Completed
plt.plot(number_of_clusters2, inertia2)
plt.show()
As we can see from above 2 charts, optimal value of n for clustering is around 12-13 for data scaled using Standard Scaler, and around 7-9 for data scaled using Normalize.
But, we are not interested in creating 8-10 customer clusters. That will become counter-productive since we only have just over 12,000 visitors and targeting that many custer can become impossible and/or ineffective at times.
As a solution to this problem, we will try n values ranging from 3-6 and identify the preferred way of clustering customers. Both scaled data version will be used to create 2 versions of clusters for stated n values.
# Clustering for n=3
km = KMeans(n_clusters = 3, random_state = 1)
km.fit(df_kmeans_Scaled)
y3=km.predict(df_kmeans_Scaled)
df['3CC']= y3
km2 = KMeans(n_clusters = 3, random_state = 1)
km2.fit(df_kmeans_Scaled2)
y3_2=km2.predict(df_kmeans_Scaled2)
df['3CC_2']= y3_2
# Clustering for n=4
km = KMeans(n_clusters = 4, random_state = 1)
km.fit(df_kmeans_Scaled)
y4=km.predict(df_kmeans_Scaled)
df['4CC']= y4
km2 = KMeans(n_clusters = 4, random_state = 1)
km2.fit(df_kmeans_Scaled2)
y4_2=km2.predict(df_kmeans_Scaled2)
df['4CC_2']= y4_2
## Clustering for n=5
km = KMeans(n_clusters = 5, random_state = 1)
km.fit(df_kmeans_Scaled)
y5=km.predict(df_kmeans_Scaled)
df['5CC']= y5
km2 = KMeans(n_clusters = 5, random_state = 1)
km2.fit(df_kmeans_Scaled2)
y5_2=km2.predict(df_kmeans_Scaled2)
df['5CC_2']= y5_2
# Clustering for n=6
km = KMeans(n_clusters = 6, random_state = 1)
km.fit(df_kmeans_Scaled)
y6=km.predict(df_kmeans_Scaled)
df['6CC']= y6
km2 = KMeans(n_clusters = 6, random_state = 1)
km2.fit(df_kmeans_Scaled2)
y6_2=km2.predict(df_kmeans_Scaled2)
df['6CC_2']= y6_2
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Now, we will create pivot tables for these custers and identify which version is more effective in terms of clustering similar customers together.
table = pd.pivot_table(df, index='3CC',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table = table[cols]
table.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 3CC | ||||||||
| 0 | 1054 | 5 | 1009 | 0.000000 | 0.180191 | 1.364769 | 0.086528 | 57.929836 |
| 1 | 9618 | 1387 | 7971 | 5.782345 | 0.032114 | 50.764207 | 8.358147 | 864.859889 |
| 2 | 1658 | 516 | 1571 | 10.253288 | 0.019477 | 305.672409 | 207.819548 | 3831.085905 |
table2 = pd.pivot_table(df, index='3CC_2',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table2.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table2 = table2[cols]
table2.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table2
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 3CC_2 | ||||||||
| 0 | 10564 | 1754 | 9229 | 6.493919 | 0.033730 | 66.577330 | 35.752847 | 1361.484717 |
| 1 | 728 | 3 | 682 | 0.000000 | 0.198339 | 0.000000 | 0.015110 | 0.023123 |
| 2 | 1038 | 151 | 640 | 3.865882 | 0.029265 | 282.437916 | 45.607501 | 335.722073 |
table = pd.pivot_table(df, index='4CC',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table = table[cols]
table.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 4CC | ||||||||
| 0 | 1674 | 1616 | 1229 | 31.885464 | 0.019398 | 85.637681 | 19.275564 | 1297.969759 |
| 1 | 8347 | 0 | 7104 | 1.247085 | 0.034125 | 51.673536 | 9.043947 | 852.938312 |
| 2 | 1041 | 5 | 996 | 0.000000 | 0.181073 | 1.381812 | 0.087608 | 55.359891 |
| 3 | 1268 | 287 | 1222 | 6.962825 | 0.019936 | 331.528802 | 250.154847 | 4243.938312 |
table2 = pd.pivot_table(df, index='4CC_2',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table2.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table2 = table2[cols]
table2.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table2
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 4CC_2 | ||||||||
| 0 | 1538 | 250 | 1124 | 6.910813 | 0.023658 | 238.192616 | 92.105426 | 698.879555 |
| 1 | 728 | 3 | 682 | 0.000000 | 0.198339 | 0.000000 | 0.015110 | 0.023123 |
| 2 | 9532 | 1581 | 8459 | 6.373509 | 0.034920 | 50.198511 | 28.782988 | 1422.055295 |
| 3 | 532 | 74 | 286 | 2.318482 | 0.032813 | 285.077097 | 16.947521 | 190.406452 |
table = pd.pivot_table(df, index='5CC',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table = table[cols]
table.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 5CC | ||||||||
| 0 | 915 | 32 | 882 | 1.360801 | 0.050171 | 30.120709 | 9.616411 | 907.302773 |
| 1 | 7682 | 0 | 6466 | 1.321865 | 0.033996 | 55.122319 | 9.560449 | 848.208278 |
| 2 | 892 | 4 | 847 | 0.000000 | 0.188856 | 1.095889 | 0.000000 | 40.252108 |
| 3 | 1195 | 282 | 1152 | 7.030489 | 0.019867 | 336.109519 | 260.538065 | 4345.424080 |
| 4 | 1646 | 1590 | 1204 | 32.085915 | 0.019334 | 86.789160 | 19.112572 | 1310.094901 |
table2 = pd.pivot_table(df, index='5CC_2',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table2.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table2 = table2[cols]
table2.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table2
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 5CC_2 | ||||||||
| 0 | 1404 | 230 | 1016 | 6.904934 | 0.022771 | 256.671525 | 26.727277 | 712.075199 |
| 1 | 727 | 3 | 681 | 0.000000 | 0.198429 | 0.000000 | 0.000000 | 0.023155 |
| 2 | 9464 | 1570 | 8402 | 6.392799 | 0.035003 | 49.647374 | 28.409172 | 1425.782218 |
| 3 | 506 | 71 | 266 | 2.319061 | 0.033083 | 290.290851 | 13.119330 | 190.904611 |
| 4 | 229 | 34 | 186 | 5.437239 | 0.029903 | 84.614537 | 489.156253 | 616.811510 |
table = pd.pivot_table(df, index='6CC',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table = table[cols]
table.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 6CC | ||||||||
| 0 | 1099 | 276 | 1066 | 7.213116 | 0.019951 | 336.974094 | 275.931679 | 4529.184175 |
| 1 | 1896 | 2 | 1539 | 1.384727 | 0.031626 | 64.687274 | 13.379048 | 892.093766 |
| 2 | 5901 | 0 | 5030 | 1.465393 | 0.034663 | 56.116772 | 9.443190 | 853.779068 |
| 3 | 1639 | 1594 | 1199 | 31.840776 | 0.019330 | 87.664742 | 19.404694 | 1316.702720 |
| 4 | 915 | 32 | 882 | 1.341616 | 0.050186 | 30.280599 | 9.727340 | 908.738671 |
| 5 | 880 | 4 | 835 | 0.000000 | 0.189832 | 1.110833 | 0.000000 | 39.222477 |
table2 = pd.pivot_table(df, index='6CC_2',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table2.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table2 = table2[cols]
table2.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table2
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 6CC_2 | ||||||||
| 0 | 9461 | 1570 | 8400 | 6.394826 | 0.035007 | 49.593859 | 28.413987 | 1425.895109 |
| 1 | 127 | 0 | 119 | 0.000000 | 0.197900 | 0.000000 | 0.000000 | 0.000000 |
| 2 | 508 | 72 | 267 | 2.309930 | 0.033013 | 289.714214 | 13.067679 | 190.813615 |
| 3 | 600 | 3 | 562 | 0.000000 | 0.198540 | 0.000000 | 0.000000 | 0.028056 |
| 4 | 1405 | 229 | 1017 | 6.900019 | 0.022785 | 256.750478 | 26.736496 | 713.613722 |
| 5 | 229 | 34 | 186 | 5.437239 | 0.029903 | 84.614537 | 489.156253 | 616.811510 |
As can be seen from above pivot tables, for n = 5 & 6, originally bigger custers are splitting and creating 2 smaller clusters. This increases the classes but will affect the focus of our marketing strategy. Because of this we will be using model with n = 3 or 4.
Between n = 3 & 4, n=4 seems to custer clietns in proper segments since segments are differentiated from other segments in average attribute values.
table = pd.pivot_table(df, index='4CC',values=['Transaction',"Transaction Status", "VisitorType Status","PageValues","ExitRates","Administrative_Duration","Informational_Duration","ProductRelated_Duration"],
aggfunc={"Transaction":"count","Transaction Status":"sum","VisitorType Status":"sum","ExitRates":"mean"
,"PageValues":"mean","Administrative_Duration":"mean","Informational_Duration":"mean","ProductRelated_Duration":"mean"})
table.columns
cols = [ 'Transaction', 'Transaction Status', 'VisitorType Status','PageValues', "ExitRates", 'Administrative_Duration', 'Informational_Duration','ProductRelated_Duration',]
table = table[cols]
table.rename(columns={'Transaction':'Count of Visitors',
'Transaction Status':'Sum of Transactions Made',
'VisitorType Status': 'Count of Returning Visitors',
'PageValues':'Average Page Values',
'ExitRates':'Average Exit Rates',
'Administrative_Duration':'Average Administrative_Duration',
'Informational_Duration':'Average Informational_Duration',
'ProductRelated_Duration':'Average ProductRelated_Duration',
},inplace=True)
table
| Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Exit Rates | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 4CC | ||||||||
| 0 | 1674 | 1616 | 1229 | 31.885464 | 0.019398 | 85.637681 | 19.275564 | 1297.969759 |
| 1 | 8347 | 0 | 7104 | 1.247085 | 0.034125 | 51.673536 | 9.043947 | 852.938312 |
| 2 | 1041 | 5 | 996 | 0.000000 | 0.181073 | 1.381812 | 0.087608 | 55.359891 |
| 3 | 1268 | 287 | 1222 | 6.962825 | 0.019936 | 331.528802 | 250.154847 | 4243.938312 |
table.reset_index(inplace=True)
table.iloc[:,[0,1,2,3,4,6,7,8]]
| 4CC | Count of Visitors | Sum of Transactions Made | Count of Returning Visitors | Average Page Values | Average Administrative_Duration | Average Informational_Duration | Average ProductRelated_Duration | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1674 | 1616 | 1229 | 31.885464 | 85.637681 | 19.275564 | 1297.969759 |
| 1 | 1 | 8347 | 0 | 7104 | 1.247085 | 51.673536 | 9.043947 | 852.938312 |
| 2 | 2 | 1041 | 5 | 996 | 0.000000 | 1.381812 | 0.087608 | 55.359891 |
| 3 | 3 | 1268 | 287 | 1222 | 6.962825 | 331.528802 | 250.154847 | 4243.938312 |
We have 4 unique segments after the kmeans clustering.
It is evident from the table that
It is evident from the table that
It is evident from the table that
This is a clear classification of Deal Seekers who visit Clifford's Website and spends some time looking for their product and price but leave without making a transaction.
It is evident from the table that
Examples of such information:
This Class - Could be classified as Customers on the EDGE. Which can be converted if adequate strategy can be adapted to compell them to make a transaction
This is a clear classification of Confused Visitors who visit Clifford's Website and spends over time looking for their product, price, information, usage, reviews but Only few of them make a transaction and buy the product.
Based on this analysis of customer segments & Chosen ML Model, following marketing strategies are recommended for Clifford's eCommerce website.
(1) Loyalty Program: Clifford's can establish a loyalty program for customers, where clients receive points in return of shopping at Clifford's. Amount of points can vary for different segments/marketing campaign etc. This will enable Clifford's to obtain loyal customers in future, similar to what we see in Loyal Shoppers cluster.
(2) Targeted Promotions for Confused Shoppers Segment: Customers who are in Deal Seekers cluster, have not made any purchase, and are predicted as "No transaction" will be targeted through this. It will improve the conversion rate for those customers. Customers from confused shoppers group can also be targeted using this to motivate them to make an transaction.
(3) Secific Website Redesign: We also recommend Clifford's to update/redesign certain pages on website. This pages are mainly administrative and informational. This is mainly targeting Confused Shoppers segment, and will help them find what they are looking for faster, in turn increasing conversion rate of visitors. This redesign will also benefit other segments as well, but the confused segment is most affected by this.
(4) Email/Text Reminders: Such reminders will alert custommers about specific aspects of their Clifford's eCommerce accounts. These aspects include, expiring promotion, new deal, item left inn cart etc. These trigger actions can be defined based on available budget and desired frequency from customer perspective.
On top of this, we also recommend that this ML prediction model and clusters are updated every quarter with new user data to potentially improve their accuracy.